Text Mining to Understand Disease-Causing Gene Variants

Nezamuldeen, Leena; Jafri, Mohsin Saleet

doi:10.3390/knowledge4030023

Open AccessReview

Text Mining to Understand Disease-Causing Gene Variants

by

Leena Nezamuldeen

^1,2

and

Mohsin Saleet Jafri

^1,3,*

¹

School of Systems Biology, George Mason University, Fairfax, VA 22030, USA

²

King Fahd Medical Research Centre, King Abdulaziz University, Jeddah 21589, Saudi Arabia

³

Center for Biomedical Engineering and Technology, University of Maryland School of Medicine, Baltimore, MD 21201, USA

^*

Author to whom correspondence should be addressed.

Knowledge 2024, 4(3), 422-443; https://doi.org/10.3390/knowledge4030023

Submission received: 15 March 2024 / Revised: 1 July 2024 / Accepted: 7 August 2024 / Published: 19 August 2024

Download

Browse Figures

Versions Notes

Abstract

Variations in the genetic code for proteins are considered to confer traits and underlying disease. Identifying the functional consequences of these genetic variants is a challenging endeavor. There are online databases that contain variant information. Many publications also have described variants in detail. Furthermore, there are tools that allow for the prediction of the pathogenicity of variants. However, navigating these disparate sources is time-consuming and sometimes complex. Finally, text mining and large language models offer promising approaches to understanding the textual form of this knowledge. This review discusses these challenges and the online resources and tools available to facilitate this process. Furthermore, a computational framework is suggested to accelerate and facilitate the process of identifying the phenotype caused by a particular genetic variant. This framework demonstrates a way to gather and understand the knowledge about variants more efficiently and effectively.

Keywords:

genetic variants; artificial intelligence; variant prediction; large language models

1. Introduction

Variants in the genetic code, known as genetic variants, are known to underlie population diversity, as seen in traits and diseases. The advancement and feasibility of whole exome sequencing (WES) techniques has played a significant role in identifying genetic variants associated with diseases. Genetic variants are thought to alter protein structure to alter protein function. Therefore, analyzing genetic variants to discover their effects on protein structure and function, which results in disease development, is a critical research topic [1].

Genetic variations are changes in the DNA sequence (mutations) that result in changes in the protein amino acid sequence. Protein structure and function can be drastically altered by genetic variations with harmful effects. Multiple chains of amino acids make up proteins structurally, and these chains serve a variety of functions ranging from sustaining the protein’s basic structure to acting as a functional area on the protein’s surface. Extracellular and intracellular proteins interact in an organized dialogue to instruct all the biological pathways for cell proliferation and survival. The impact of the genetic variant on protein function is dictated by the type of protein, its function, and the exact position of the variant. The variant’s position may be crucial for protein function. If the genetic mutation is situated in an important functional feature such as the active site (the place with a molecule binds a protein and undergoes a chemical reaction, binding site (the place where a molecule or ion contacts the protein), or domain (functional feature of a protein) where it modifies the structure of these sites, or if it changes the core of the protein’s structure, it can have significant effects. On the other hand, the variant located elsewhere might cause a structural shift that affects the orientation or accessibility of these sites. The genetic mutation not only affects the translated protein’s structure and function, but it also can inhibit its interaction with other proteins, possibly resulting in the onset of the disease.

1.1. Background

Genetic variants are a fundamental aspect of human biology, influencing traits, disease risk, and responses to the environment. Advances in genetic research and technology, such as whole genome and whole exome sequencing, are continuously improving our understanding of these variants, paving the way for personalized medicine and new therapeutic approaches. Genetic variants can be categorized by various factors, such as time of onset (de novo or inherited), information content (coding, regulatory, or non-coding), frequency (rare or common), complexity (monogenic or polygenic), and inheritance pattern [2]. Nonsynonymous genetic mutations have a much greater impact on the structure and function of proteins compared to synonymous mutations. Nonsynonymous genetic variants can be classified into multiple classes, including missense mutations, nonsense mutations, insertion mutations, deletion mutations, duplication mutations, frameshift mutations, and repeat mutations. Genetic variants with deleterious effects is the term used to describe these types of mutations in the DNA sequence. It is important to identify and analyze the functional effects of nonsynonymous mutations on proteins in order to enhance our understanding of how genetic variation might contribute to reducing or increasing the risk of developing complex diseases and disorders. For example, a recent genetic study found that two rare APOE gene variants, V236E (in the APOE ε3 isoform) and R251G (in the APOE ε4 isoform), are associated with a significantly lower risk of developing Alzheimer’s disease (AD). This association was observed in a large sample size from multiple cohorts and considered the age of onset for the disease [3]. The risk reduction associated with these variants suggests that protein chemistry and functional assays of these variants should be studied to guide drug development targeting APOE. Research into discovered genetic variants linked with the etiology of autism spectrum disorder (ASD) is vast. However, this disorder remains incurable at present. One study discovered that 13 ASD patients had distinct genetic variants in four genes (BRSK2, ITSN1, FEZF2, and PAX5). Despite the fact that all patients shared these four genes, the genetic variant locations and phenotypic symptoms varied among them [4], necessitating the need to further investigate the functional effects of these variants on biological pathways (the network of biochemical reactions that underlie cellular function) in each of the four proteins. Another study conducted whole exome sequencing on 253 individuals with autism spectrum disorder (ASD), including 68 with intellectual disability (ID) and 90 with Asperger’s syndrome. It was found that CHD8, SHANK3, NRXN1, and KMT2A each had a different genetic variant present in multiple patients from either the ASD/ID group or the Asperger’s group [5]. These genes and their genetic variants are all significant contributors to the genetic heterogeneity of ASD, and their study can provide valuable insights into the underlying mechanisms of the disorder to find a precise treatment. Another study in China screened for TBK1 variants in the largest cohort of amyotrophic lateral sclerosis (ALS) patients and identified eight missense variants and one splice site mutation in the TBK1 gene. The clinical manifestations of TBK1 carriers are diverse, expanding the genotypic spectrum of ALS patients with TBK1 variants [6]. The results of this study highlight the significance of genetic analysis in understanding the different clinical features of ALS patients with TBK1 mutations and have the potential to help with the discovery of targeted therapy options. Proteins play crucial roles in various biological processes and pathways. Understanding how genetic variants change protein structure and function and can affect the dynamics of biological pathways is key to developing targeted therapies [7]. Analyzing the effects of specific protein genetic variants can reveal key points of disruption in biological pathways that lead to disease phenotypes. It can also predict the downstream effects of pathogenic genetic variants on cellular processes and physiological states, determine if a protein with genetic variant has lost its normal function or gained a novel function, visualize causal relationships between a protein with genetic variant and its downstream effects, and assess whether bypass processes exist to mitigate the effects of a genetic variant [8]. Additionally, it can help prioritize de novo mutations associated with complex diseases; predict how genetic variants affect properties such as protein stability, flexibility, and binding energies [9,10]; and prioritize targets for therapeutic intervention.

1.2. Motivation

Studying protein variants within the context of biological pathways is essential for elucidating disease etiology, identifying therapeutic targets, and designing targeted treatment strategies. Integrating genetic variant data with biological pathway knowledge using computational methods is an effective approach to achieve understanding. Computational approaches leverage our growing understanding of cellular pathways and networks to bridge the gap between genetic associations and disease phenotypes. For example, integrating GWAS data with biological pathway and molecular network information can strengthen the identification of trait-associated genes [11]. Furthermore, analyzing the effects of specific protein variants on pathway dynamics by prioritizing these variants for further investigation based on their predicted impact on cellular processes and comparing them to physiological states can help identify key points of disruption in pathways that contribute to disease. Using pathway-guided computational methods, such as machine learning models that incorporate pathway information, can improve the understanding and predictive accuracy of disease models in comparison with analyzing genes individually [12]. The motivation to develop computational frameworks for identifying the phenotype caused by a particular genetic variant is driven by the need to understand the functional impact of genetic variations on biological pathways efficiently and accurately, leading to more precise diagnosis and treatment of diseases. The abundance of genetic databases, genetic variant databases, protein databases, biological pathways databases, drug databases, and the PubMed database has been steadily increasing over time. However, integrating the information from all of these databases for studying the effect of a single genetic variant on protein structure and function, as well as its subsequent role in disease development, is a time-consuming process. To address this challenge, there is an urgent need to develop an automated search approach that can replace the manual curation process and help discover the effects of genetic variants on disease development. The method proposed in this review aims to streamline this process by helping to identify and contain the number of studies that should be reviewed to understand the effect of a genetic variant on proteins. In our previous work in creating protein–protein interaction (PPI) networks using text mining [13], the step of mining these studies was completed, thereby uncovering the effects of genetic variants on the development of diseases.

1.3. Objectives

The automated search approach proposed in this review can significantly reduce the time and resources required to gather information about the impact of genetic variants on protein structure, function, and disease development. This approach can efficiently extract, normalize, and integrate relevant information from various databases, thereby providing a more comprehensive understanding of the effects of genetic variants. In addition, automated approaches can improve accuracy and consistency by recognizing, disambiguating, and normalizing genetic variant information, leading to more accurate and consistent results. The key objectives of this automated search approach are to streamline the integration of diverse genetic and biomedical data, improve the accuracy and consistency of variant interpretation, and accelerate the translation of genetic insights into improved disease understanding and therapeutic development. As the volume of genetic and biomedical data continues to grow, the automated search method proposed and developed in this review in combination with our previously proposed PPI network extraction approach can easily scale to handle the increasing information load and be iteratively adapted to incorporate new data from database sources. This allows for a more efficient and effective way to understand the effects of genetic variants on various diseases.

2. Literature Review

2.1. Online Resources

There are many online resources that are important to achieve the goal of understanding knowledge on how genetic variants impact function and give rise to disease or traits. There are online databases that have vast quantities of information cataloging variants, their functional implications, pathway involvement, and references supporting these data. Because many of the variants that have been discovered are unknown, online tools have been developed to predict pathogenicity. More recently, text mining and large language models have been developed that aid in the extraction of knowledge from the large-volume textual data.

2.1.1. Databases

Several major databases catalog genetic variants. The National Center for Biotechnology Information (NCBI) dbSNP database is the first basic reference to find information on genetic variants and their effects (https://www.ncbi.nlm.nih.gov/snp/, accessed on 12 March 2024) [14]. LitVar [15] is an NCBI web application tool used to retrieve relevant information for the genetic variant including disease and drugs using text mining methods (https://www.ncbi.nlm.nih.gov/research/litvar2/, accessed on 12 March 2024). ClinVar is another free database that contains information on different genetic variants and their disease-causing potential and effect on drug responses with supporting evidence (https://www.ncbi.nlm.nih.gov/clinvar/, accessed on 12 March 2024). The Online Mendelian Inheritance in Man (OMIM) database provides information human genes and their genetic variants (https://www.omim.org/, accessed on 12 March 2024). The European Molecular Biology Laboratory (EMBL) also supports a variant database called the European Variant Archive (https://www.ebi.ac.uk/eva/ accessed on 12 March 2024).

Of the many other databases available, some are freely available, and some require a subscription that contains valuable knowledge. Some of these databases are disease specific. For example, for cancer variants, knowledge about cancer biomarkers and drug resistance biomarkers can be studies such as The Cancer Genome Atlas (TCGA) maintained by the National Cancer Institute (NCI) (https://www.cancer.gov/ccg/research/genome-sequencing/tcga, accessed on 12 March 2024), The Jackson Laboratory Clinical Knowledge Base (JAX-CKB) (https://www.jax.org/clinical-genomics/ckb accessed on 12 March 2024), and The Sanger Institute’s Catalogue Of Somatic Mutations In Cancer (COSMIC) database (https://cancer.sanger.ac.uk/cosmic, accessed on 12 March 2024).

There are also databases that provide information about protein structure and function. Uniprot (https://www.uniprot.org/, accessed on 12 March 2024) is a database that has information about protein structure, function, sequence, and variants and their phenotypes. The Protein Data Bank (PDB) is a database of the structures of proteins, peptides, and proteins bound to ligands (https://www.rcsb.org/, accessed on 12 March 2024). The PDB also contains structures of proteins with genetic variants.

These databases collect and combine data on genetic variation from research and publications. They also offer access to resources that help comprehend the biological and clinical importance of genetic variants. They vary in terms of their extent, techniques of selection and organization, and models for representing information. Although these databases are considered excellent resources for genetic variant information, a substantial proportion of the genetic variants documented in these databases are classified as variants of uncertain significance (VUS). The phenotypic effect or illness associated with these polymorphisms has not been definitively confirmed. Evaluating the pathogenicity of a genetic mutation requires supporting information from various sources, including computational predictions, classification data, functional tests, and case–control comparisons. In the absence of these supporting data, the majority of novel missense variants, rare variants, and variants that have been re-categorized as benign are categorized as VUS. Continued research and data sharing are necessary to classify VUS.

2.1.2. Variant Classification Tools

Previous studies and developed webtools have demonstrated significant results and progress in the field of discovering the effect of the genetic variants such as SIFT v6.2.1 [16], PROVEAN v1.0 [17], Polyphen-1, Polyphen-2 [18], MutationAssessor v0.75, FATHMM v2.3 (Functional Analysis through Hidden Markov Models), PhD-SNP v2.0.7, SNAP v1.0, PANTHER v19.0, Auto-Mute v2.0, PLINK v2.0, CC/PBSA v1.0, AlphaMissense v1.0 [19], and I-Mutant v2.0 [20]. Several consensus tools, such as PON-P v2.0, Condel v2.0, and PredictSNP v2.0, have been created to combine multiple tools to obtain a better prediction of pathogenicity using the default settings. Many of these tools start with the initial structure from the Protein Data Bank and allow the user to specify a list of single variants. After the submitted job is queued, the predictions are returned. The tools give mixed results, and hence the consensus tools have been developed. Other methods have been developed that overcome these limitations. The molecular dynamics phenotype prediction model (MDPPM) method applies machine learning to feature sets obtained from molecular dynamics simulations (such as phi and psi dihedral angles of the amino acid back bone) to obtain highly accurate predictions of the phenotype of protein variants [21,22]. Moreover, MDDPM can also predict the effect of multiple variants in the same protein, differentiate between different phenotypes (e.g., diseases), and make quantitative predictions about severity. However, this method is computationally intensive, and expertise is needed in machine learning and molecular simulations.

RegulationSpotter v1 [23] is a web-based tool for annotating genetic variants on extragenic regions of the DNA. CUPSAT v1.0 [24], MutPred v2.0 [25], PMut v1.0 [26], AUTO-MUTE v2.0 [27], FoldX v4.0 [28], CC/PBSA v1.0 [29], RosettaCommons [30], I-Mutant v2.0 [31], Hunter v1.0 [32], and many other advanced web tools to identify the stability of protein structures regarding the effect of the missense variants. The free existing webtools can predict the impact of a genetic mutation, but they cannot specify these impacts and their importance on other biological elements or biological pathways. It is crucial to integrate the exceptional features of all the previously mentioned web tools to develop a workflow that can analyze the impact of genetic variants.

2.1.3. Pathway Analysis Tools

Biological pathways play a crucial role in the organization and regulation of metabolic processes in living organisms. They facilitate the efficient processing of molecules; control the activity of crucial enzymes; integrate various signals; synchronize numerous cellular functions; and play a role in cellular survival, evolution, and differentiation [33]. Understanding these biological pathways can help in developing selective treatments to modulate pathway activities for various medical conditions. Multiple biological pathway databases are accessible. Each provides researchers with invaluable resources for investigating and assessing diverse biological pathways. The Kyoto Encyclopedia of Genes and Genomes (KEGG v111.0) is an extensive database that provides a deep understanding of biological systems at the molecular level. It includes data of pathways, diseases, drugs, and genomes. KEGG pathways are manually curated and consist of visual representations of molecular interactions [34]. Reactome v89 is a database that offers manually curated pathways for different biological processes, such as metabolism, signaling, and gene expression with comprehensive pathway diagrams and annotations [8]. BioGRID v4.4.233 collects experimental data on protein–protein, genetic, and post-translational modifications from literature and high-throughput experiments, using manual curation to ensure accuracy and quality, assigning confidence scores, and providing contextual information [35]. Pathway Commons combines data from various pathway and interaction databases, including Reactome, KEGG, BioGRID, and others, promoting community participation and contribution to enhance coverage and accuracy of pathways and interactions [36]. BioPAX v3.1 is webservice of Pathway Commons v11 [37]. Regarding all these databases, additional manual curation is necessary to examine a specific genetic variation mutation that appears in a domain or other functional locations in the protein to detect its effect in developing diseases.

It is essential to determine the specific biological pathway to which a protein with a genetic mutation is related, as well as to identify the other proteins that interact with it. Elsevier’s Pathway Studio is a bioinformatics software platform designed to analyze and visualize biological pathways, gene regulation networks, and protein–protein interaction networks. Ingenuity Pathway Analysis (IPA v24.0.1) is a software tool used to analyze biological pathways and networks. IPA, developed by Qiagen, is a sophisticated software package that allows for the analysis and interpretation of omics data (such as gene expression, proteomics, and metabolomics) within the framework of biological pathways, networks, and diseases. The software offers tools for visualizing data, conducting enrichment analysis, and performing predictive modeling. Cytoscape v3.10.2 is an open-source software used for the visualization and analysis of molecular interaction networks [38]. Although Cytoscape is not primarily intended for pathway analysis, it may be enhanced with many plugins and apps that provide pathway enrichment analysis, network visualization, and integration with other bioinformatic resources. Although these tools are designed for pathway analysis, they are too costly and require a training course to help researchers in their investigations. Biological pathways can be viewed as networks of protein–protein interactions. The pictures drawn in the shape of protein–protein interaction networks show the biological pathway activities with no quantitative details such as in kinetic systems. The simplest method for analyzing these types of networks is the Boolean network model. In these models, the nodes take the label of binary values (active or inactive, 1 or 0) and the edges express the relations between the nodes with logical operators, depending on logical functions defined between the nodes [39]. These models can provide kinetic information such as the steady state of each node in the system that can provide insights into the average activation levels of the proteins in the biological pathways. This kinetic information can be used later in advanced kinetic systems to support the discovery of the cause of the disease. In addition, these steady states can be compared before and after the effect of any node perturbation to indicate the effect of genetic mutations [40]. The SPIDDOR (Systems Pharmacology for efficient Drug Development in R) package in R to analyze Boolean networks has the advantage of changing the activity of the nodes in the network and introducing a mutation-like activity [41].

This R package was utilized in our study [42] to investigate the effect of genetic variants in seven proteins causing autism in four patients. The seven proteins are DDX26B/INTS6L, USP9X, RPS6KA6/RSK4, FAF5, FLNA, IDS, and SUMF1. After creating the protein–protein interaction network, a convergence between mTOR and Wnt pathways was discovered. The SPIDDOR tool transfers the protein–protein interaction network to the Boolean network model and analyzes it. The advantage of using the SPIDDOR tool to introduce mutation-like activity to the proteins with genetic variation was important to compare the severity effect of these mutations on the other proteins’ activities in the network. Furthermore, it highlighted the defects in the biological pathways and where to focus in treating this patient by reversing the defect occurring in the biological pathways and improving the developed traits. Overall, the use of Boolean network modeling is recommended because of the lack of the expression profile of these patients and the numbers needed to measure the severity of the effect of the mutated genes on the phenotype appearance. In summary, the Boolean network modeling serves as a supportive tool for complex biological networks that lack expression or kinetic information.

2.2. Text Mining

Text mining, or text analytics, refers to the extraction of valuable information and insights from unstructured text. Because of the rapid expansion of digital information, text mining has gained significant importance in different fields, including information retrieval, sentiment analysis, entity recognition, summarization, and other applications [43]. An example of an entity recognition method is a study that combines text mining and structure-based analysis to predict protein functional sites. Researchers used dynamics perturbation analysis (DPA) to predict functional sites at control points where interactions significantly disrupt protein vibrations. Researchers used the entity recognition method to extract residue mentions from abstracts of protein structure papers. The method was based on defining patterns such as residue full name, one-letter code or three-letter code, and the position of the amino acid in the sequence. This approach significantly helps bridge the growing gap between knowledge captured in the literature and what is available in structured databases, especially for proteins with unknown functions or limited experimental data [44].

Another study demonstrates the potential of integrating text mining and PPI analysis to evaluate the effects of combined chemotherapeutic and chemopreventive agents in cancer therapy. Their results suggest that this approach can help identify effective combinations of agents and potential therapeutic targets, which can be validated through in vitro experiments [45]. The text mining approach they used was embedded in the Pathway Studio software v12.3 of Elsevier, which has a built-in natural language processing (NLP) module called MedScan, one that analyzes sentence semantics and lexical structure to identify biological entities like proteins, molecules, functional classes, cell processes, and diseases.

The tmVar v3.0 software is a text mining tool that uses the conditional random fields (CRF) model and regular expression patterns for extracting a wide range of genetic variants described in the biomedical literature. The creators of this NCBI tool used human genome variation society (HGVS) format standers for writing genetic variations in articles to define their regular expression patterns [46]. The CRF model is used to extract the genetic variations from the articles while the regular expression patterns ensure that the extracted word conforms to HGVS nomenclature standards. By integrating the two methods, they minimize the false negative results, and they achieve high performance as a text mining tool that can automatically extract and normalize a wide range of genetic variants from the biomedical literature. Both techniques are considered as entity recognition methods in text mining.

A survey article explored various deep learning models used in motif mining. It categorized these models into three types: convolutional neural network (CNN), recurrent neural network (RNN), and hybrid CNN-RNN-based models. The paper mentioned that Classic models like DeepBind v1.0 [47], DeepSNR v0.9.0 [48], DeepSEA [49], Dilated [50], DanQ [51], BiRen [52], KEGRU [53], and iDeepS v2.0 [54] are used for tasks like identifying DNA, RNA, or protein binding sites and understanding gene regulation and management, as well as comparing their performance [55]. The models can be viewed as sentiment analysis tasks in text mining, where the datasets are comprised of segments of genomic sequences. These segments are labeled based on the k-mer or the length of the segmented sequences, allowing for the classification of genomic features into distinct categories. We developed a method for text mining biomedical abstracts to generate protein–protein interaction networks [13]. The model’s architecture is based on three approaches. The initial approach was sentiment analysis, followed by named entity recognition and pattern definition using SpaCy’s v3.7 shortest dependency path model. The three models were sufficient to extract protein interactions in an organized visualization using NetworkX v3.3. The sentiment analysis model achieved 95% precision on the Aimed/BioInfr corpus, and the named entity recognition model achieved 98% precision also on the Aimed/BioInfr corpus.

Text mining has a wide range of applications in genomics. Not only can it be used to extract information from biomedical literature, but it also can be used to analyze epigenetic data and identify patterns and trends in gene regulation and expression. Furthermore, text mining can integrate data from various ‘omics’ disciplines, such as genomics, transcriptomics, and proteomics, to understand complex biological processes and disease mechanisms. These applications highlight the significant role text mining plays in the genomic field, enabling researchers to extract valuable insights from large volumes of data and streamline their workflows.

2.3. Large Language Models

Large language models (LLMs) are a type of generative artificial intelligence model that generate human-like text at a scale that surpasses human language capabilities. These models, trained on vast text data and using advanced machine learning techniques, have various applications, including text generation; language understanding; knowledge extraction, content filtering, and moderations; and personalization [56]. BioBERT v1.1, BioGPT, and recently BioNeMo v1.7 have already been trained on massive biomedical corpora, such as PubMed abstracts and articles, to capture domain-specific knowledge and terminology. BioBERT was developed by Google [57], BioGPT was developed by Microsoft [58], and BioNeMo was developed by NVIDIA. The pre-training method facilitates these models to learn contextual representations of biomedical text, which can be adjusted for specific downstream tasks such as creating an artificial intelligence (AI) model for biomedical named entity recognition, relation extraction, question answering, etc. STRING DB v9.1 is a protein–protein interaction database that utilizes BioBERT and now RoBERTa, a larger language model, in its text mining evidence-based model for protein relations extraction [59]. The deep learning model achieves a twofold increase in the number of protein–protein interactions across low, medium, and high confidence score cut-offs compared to its previous text mining models. BioGPT has been employed in numerous research studies to train models for novel findings in specific domains, such as age-related diseases [60], drug–drug interactions [61], microbiome–disease relations [62,63], and inter-bacterial associations. NVIDIA BioNeMo is a framework created by NVIDIA specifically designed for training and implementing sophisticated conversational AI models in the field of biomedicine. It offers pre-trained models and datasets specifically designed for tasks involving the analysis of biological language. The system utilizes NVIDIA GPUs to accelerate the process of training and interpretation, making it well-suited for managing extensive biomedical datasets. Also, it provides the capability to construct a generative AI model for drug development, combined with other frameworks such as target identification, 3D protein structure prediction, and docking. The implementations encompassed in this approach included the creation of functional proteins, analysis of protein structure, fighting of infectious diseases, and process of drug discovery [64,65,66,67].

AI smart chatbot systems are expanding tremendously. AI chatbots use NLP and LLM models to analyze user inputs and respond accordingly. They can be integrated with other systems to provide more comprehensive support and can be used for tasks such as writing emails, generating code, creating images, summarizing text, and generating information. There are various platforms and tools available for building and using chatbots, including Perplexity AI, OpenAI’s GPT, Microsoft’s Copilot Google’s Gemini, Meta’s LLaMA, Anthropic’s Claude, and Mistral AI’s models that provide an AI chatbot connected to the internet and offer features like prompt suggestions and footnotes with sources. Chatbot programs vary in cost, with some being entirely free and others requiring monthly subscriptions. The best free AI chatbot is Copilot by Microsoft. It uses OpenAI’s GPT-4 language model, making it proficient in various language tasks. On the other hand, Perplexity AI is the best in giving longer answers to a complex question with monthly subscription fees. LLMs have several challenges, including bias and fairness, ethical implications, safety and security, explainability and transparency, resource consumption and environmental impact, generalization and contextual understanding, data privacy and confidentiality, continuous learning and adaptability, and deployment difficulties. Bias in training data can lead to unfair results, and detecting and mitigating these issues is crucial. Ethical considerations are necessary for content generation, deepfakes, misinformation, and societal impacts. Ensuring LLMs do not generate harmful information is essential, and improving model explainability and transparency is crucial for trust and understanding. Addressing the environmental footprint of LLMs while maintaining performance is a growing challenge. Developing strategies for continuous learning in LLMs and addressing practical challenges in real-world scenarios are also essential for successful implementation. These challenges highlight the complexity and multifaceted nature of working with LLMs, emphasizing the importance of addressing these issues to harness their full potential while mitigating associated risks and limitations. Overall, LLMs have the potential to significantly enhance various industries such as education, healthcare, decision making, productivity, customer experience, and more and improve human interaction with technology.

3. Examples and Case Studies

3.1. Autism Spectrum Disorder as an Example of the Implications of Genetic Variants

Many diseases are caused by dysfunctional proteins developed from genetic variants, and from this perspective, we tried to annotate the genetic variants mentioned in the research by [68] to discover their effects on protein function and development of autism spectrum disorder (ASD) in certain patients. The researchers performed their analysis on 19 Saudi Arabian families with autistic children using the whole exome sequencing method and discovered 47 variants that are uncommon in 17 of the 19 families. Several variants were found in the dbSNP database. A total of 10 of the 47 variants have previously been identified as autism candidate genes in other studies, and five genes (GLT8D1, HTATSF1, OR6C65, ITIH6, and DDX26B) were de novo. The researchers in the article used the GATK v4.0 variant caller of Broad Institute, ANNOVAR, v2024Jun17 and QIAGEN’s Ingenuity Pathway Analysis (IPA) software v24.0.1. The list of genetic variations was annotated and filtered using these technologies. Additionally, these tools were used to identify shared biological pathways among these variations. The IPA software shows the list of proteins and the common biological pathways without mentioning how these variants affected the protein function and the biological pathways. These mutations were positioned in a key region of the protein structure, according to our manual annotation’s findings for the four de novo genes with genetic variations (HTATSF1, OR6C65, ITIH6, and DDX26B). We located each variant and investigated its functional location on the protein carefully by reviewing various public protein databases and the previous studies that have applied to the protein. Our manual approach to searching for and tracking each variant yielded promising results in determining how the variant developed the disease, but the manual curation was time-consuming. While manually tracking each variant, we discovered an organized approach. As a result, we aimed to develop an automated search method for large databases related to proteins and their genetic variants including the PubMed database based on the organized approach discovered. The automated approach searches and for identifies the natural characteristics and functions of the region where the genetic variant is located in order to provide insights into how this variant contributes to disease development, which might lead to the discovery of the appropriate treatment.

In order to identify the genetic variant position on the protein to assess the natural feature of the region, an information gathering process regarding the region’s structure and function from the UniProt database [69], Post-Translational Modification (PTM) databases, protein domain databases, and protein structure databases should be performed to create an information base. This information base will generate precise keywords for retrieving relevant literature aid in understanding the impact of the genetic variant on protein structure and function and the development of the disease.

3.2. The Automated Curation

A methodology was developed to streamline the curation process for a specific genetic variant in order to identify the natural characteristics of the area inside the protein where it is situated and anticipate its impact on the protein’s function (Figure 1).

Figure 1 illustrates the automated curation process, starting with entering the protein name, genetic variant (e.g., F258L), and mutation type. The data entry is broken into five pieces: the protein name; the first amino acid; the genetic variant’s position in the protein sequence; the second amino acid; and the type of mutation such as missense, nonsense, insertion, deletion, duplication, and frameshift. We included all amino acids in the second amino acid data entry, and we also provide the symbols for each amino acid (X for nonsense mutation, ins for an insertion mutation, del for a deletion mutation, delin for an indel mutation, fs for a frameshift mutation, dup for a duplication mutation, and multi AA if several amino acids are affected by the mutation). Following the entry information, the variant position is checked against UniProt and iPTMnet databases to see if it is located in a crucial functional area of the protein. The UniProt database was selected for its detailed information on the ranges of amino acid sequences corresponding to functional areas of proteins, such as domains, repeats, and zinc fingers, which are clearly specified on each protein’s webpage. The iPTMnet database was used due to its focus on post-translational modifications (PTMs) to determine if the variant is a PTM site or located in proximity to a PTM site. The next two sections contain a detailed explanation of how to curate the two databases.

3.3. UniProt Database Curation

All protein databases are linked by the UniProt ID, so the starting point to find information about the genetic variants related to a certain protein is the protein’s primary reference database, UniProt. Every public database has an Application Programming Interface (API) that allows users to access, use, and manipulate the data stored. To access UniProt programmatically, requests [70] and BeautifulSoap [71] libraries were used in python. Requests is an HTTP client library that is used to access the database websites, and BeautifulSoap is used to parse the websites in HTML or XML format. Accessing the UniProt specific entry using request library requires creating an appropriate query relating to this entry, which can be challenging. We have the protein name and the genetic variant we need more information about, including its position on the protein. We aim to address the question of whether this genetic variation is situated in a functional region of the protein. The query we created consisted of the user-provided protein name, the organism ID for human (9606), the requirement for the protein to be reviewed (reviewed: true), and the desired output format as XML. After retrieving the output of the query as an XML object, the BeautifulSoap library was used to extract information about the protein’s function, accession number, and other names (short and long) and synonyms. The UniProt database contains a specific section on each protein page that provides a comprehensive list of functional regions, along with their corresponding positions on the protein sequence. We used these ranges to determine if the previously entered position of the genetic variant falls within any of these functional regions of the protein. If the genomic variant resides inside a functional region corresponding to a domain, the search will be extended to include the Conserved Domain Database (CDD) and Prosite Database in order to determine the name and function of this domain. In retrieving the information from CDD and Prosite databases, the requests and BeautifulSoap libraries were used, but the query used with the requests library contained the protein accession number extracted previously from UniProt.

3.4. iPTMnet Database Curation

Post-translational modification (PTM) of proteins involves functional group binding to specific sites on the proteins. These modifications impact the protein’s function and its involvement in cellular activities such as cell signaling, metabolism, and gene expression. Major PTMs include phosphorylation, acetylation, methylation, ubiquitination, and sumoylation. Mutations in the PTMs’ sites can result in developing diseases such as cancer, neurodegenerative disorders, and metabolic syndromes. It is crucial to comprehend the mechanisms and functions of PTM mutation sites to explain the defect in cellular processes and develop targeted therapies for human diseases. From this perspective, finding if the genetic variant entered earlier is a PTM site or near a PTM site is important in order to find the genetic variant impact on developing the disease. The iPTMnet [72] database is the leading PTM database. Requests and BeautifulSoap libraries were used to access and retrieve the information from this database. The query used to access this database contained the protein accession number extracted from the UniProt database earlier.

3.5. Arranging the Information

After collecting essential information about the genetic variant and its position in the protein from UniProt, iPTMnet, CDD, and Prosite databases, five data lists were generated as an outcome of the automated search results. The first data list is a collection of protein names that includes names in various formats, such as short, long, and synonymous. The second data list consists of domain names in various formats if the genetic variant is situated within a protein domain. The third data list is a list of region names, which includes the names of different regions in various forms, such as zinc finger, transmembrane, and repeat, if genetic variants are present in such regions. The fourth data list contains genetic variants presented in several ways, indicating the amino acids altered in one-letter code, three-letter code, or full-name code with the word (mutations OR substitutions). Examples of the genetic variation layouts: F349L can be expressed as F349L, Phe349Leu, or Phenylalanine349Leucine. The fifth data list includes genetic variants presented in various formats together with terms such as phosphorylation, ubiquitination, or acetylation, if the genetic variant is located in or near a PTM site. All the terms in the five data lists are used to generate a list of search terms using (AND logical operator), and further, to use them in the PubMed database and extract the literature and the abstracts related to the protein and its genetic variant.

3.6. Knowledge Retrieval

There is a vast amount of research on proteins and genetic variations associated with ASD. Also, this neurological disorder is difficult to cure, and there are only two FDA-approved treatments (Risperidone and Aripiprazole). We were interested in (HTATSF1, OR6C65, ITIH6, and DDX26B) de novo genes with genetic variants from the study [68]. Searching for specific scientific literature for certain proteins to understand their structure and function, as well as discovering the effect of the genetic variant on developing the disease, is a challenging task. Before the development of the automated search for all protein databases, a manual curation was performed for a limited amount of research on these proteins. By applying the automated approach, the search terms were enhanced, leading to an increase in the number of studies. This suggests that further insights can be obtained from these studies, leading to a more comprehensive understanding of the impact of the genetic variant on disease development. Examples of results for these genes are described in Appendix A.

4. Discussion

Genetic variations are significant in biology, influencing genetic diversity, traits, and diseases. When annotating genetic variants, it is crucial to consider two primary pieces of information: the specific gene in which the variant is located, which helps identify the biological context and potential functional implications, and the precise location of the variant within the gene, which is essential for understanding the variant’s potential impact on protein structure and function. These two pieces of information provide a solid foundation for annotating genetic variants, enabling a better comprehension of their effects on biological pathways and disease development. Genetic variant databases are the main repositories for human genetic variations. Yet, many genetic variant databases retain their own annotations and datasets, resulting in variations in the data [73]. Furthermore, the insufficient inclusion of certain populations in genomic research and reference databases may present a diversity in genetic databases. These variances in variant interpretation can result in differences, especially among individuals with different ancestry, and they can reduce the informativeness of genetic testing results for particular ethnic groups [74]. The presence of genetic variants with unknown functional impact in these databases that classified them as VUS poses a challenge in their interpretations and inclusion in these databases.

In this paper, we reviewed some tools that are used for predicting the genetic variants pathogenicity as they are essential for the diagnosis and management of genetic disorders. Most of these tools are developed using machine learning approaches with structure-based modeling or sequence-based methods. The challenges of these tools are that they may yield varying results, impacting the consistency of variant interpretation across databases. Many factors can impact the precision of these tools such as the quality of the training data, the complexity of the method, the validation process, and the continuous updating of the method [75]. By closely maintaining these factors, researchers may ensure the reliability of variant pathogenicity tools and utilize them to enhance the diagnosis and management of genetic disorders.

Integrating genetic variation data with biological pathway knowledge provides a valuable approach to elucidate the functional effects of protein variants. Biological pathway databases are extensive collections of manually curated and annotated pathways that represent molecular interactions, reactions, and relationships within biological systems. These databases are essential for comprehending biological processes and their relationships to diseases. The process of updating scientific and clinical databases with the latest information from ever-expanding scientific literature is challenging due to the vast volume of data. Therefore, the development and enhancement of computational tools that automate parts of the literature search to consolidate and present knowledge are crucial for scientists, offering efficiency and facilitating access to updated information. Text mining approaches can effectively overcome this challenge by analyzing the huge volume of unstructured texts from research articles. NLP, deep learning, LLMs models, and the combination or layering of various algorithms have been demonstrated to boost accuracy in text mining methods. The reviewed tools and method in text mining in this paper are valuable tools to evaluate the effect of genetic variants on protein structure and function and develop the disease. The primary challenge lies in the fact that these tools such as the deep learning or NLP or LLMs models generate text, requiring users to manually collect and integrate the information to form a comprehensive picture. This process can be time-consuming and laborious. Two notable exceptions are STRING and Pathway Studio. STRING is a free tool, but it lacks the precision of Pathway Studio, which, although not free, provides more accurate results. However, this precision comes at a cost, making Pathway Studio less accessible to some users.

A tool that combines the automated curation of a database about specific genetic variants for specific proteins, such the method we developed in this paper, with a text mining method that extracts information from biomedical articles about this specific protein to create a PPI network, offers a solution to expand the knowledge about the effect of the genetic variant on disease development by leveraging curated databases, as well as PubMed. By querying multiple databases and integrating the results, this tool not only reduces the time spent by users searching through literature but also provides a comprehensive and reliable summary of the available information. However, this area still requires further research and refinement to achieve optimal performance. Many complex diseases, such as autism spectrum disorder, Alzheimer’s disease, and ALS, are influenced by multiple genes. Existing tools are insufficient in addressing this complexity. To create knowledge by mining literature and databases, tools must be designed to account for this multifactorial nature. New methods need to be developed to uncover the intricate relationships between genes and their contributions to these complex diseases.

5. Conclusions

An extensive understanding of the genetic mutations associated with diseases such as neurodegenerative diseases and their impacts on protein function are remaining in development in the scientific community, with many previous significant efforts made such as the automated curation method and the tools reviewed in this paper. Genetic variations play a crucial role in biology, influencing genetic diversity, traits, and diseases. Accurate annotation of genetic variants requires consideration of manually curating genomic, protein, and pathway databases. Large language models and advanced developments in the field of text analysis have the potential to advance the field to address the challenge of manual curation of these databases.

This article demonstrates that knowledge discovery methods help to clarify how genetic variants mentioned in the study by [68] are causing the appearance of the autism phenotype through affecting the protein structure and, as a result, its function. were supported by previous significant studies on HTATSF1, OR6C65, ITIH6, and DDX26B. The automated curation method example in this paper produced results consistent with labor- and time-intensive deep manual curation. Our results showed significant outcomes, and we aimed to draw a clear picture on how these variants are affecting the protein function and causing autism in order to guide the utilization of the proper available medicine on the market that could be the target treatment for those individuals and others who may share the same mutations, or to shape new clinical trial chances. A need for an automated genetic variant mining approach in combination with a text mining approach to create a PPI network would lead to the discovery of the effect of genetic mutations on protein structure and function to cause the disease. Future work would lie in integrating our automated data mining approach with the PPI network creation we developed before so as to present a new web tool that can predict the effect of a specific genetic variant on protein structure and function in the development of a disease.

Author Contributions

Conceptualization, L.N. and M.S.J.; methodology, L.N.; software, L.N.; validation, L.N. and M.S.J.; formal analysis, L.N.; investigation, L.N.; data curation L.N.; writing—original draft preparation, L.N. and M.S.J.; writing—review and editing, L.N. and M.S.J.; supervision, M.S.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data from this study are available by contacting the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. HTATSF1

HIV TAT Specific Factor 1 (HTATSF1) is a cofactor that promotes HIV1 protein transcription. The genetic variant is a missense variant that causes an amino acid substitution from phenylalanine to leucine in position 298 (F298L) of the translated protein’s sequence. Using the automated search example shown in Figure 1, the information extracted from the UniProt database includes protein function, accession number, full names and short names, and the region that the genetic variant located in. Because the automated search found that this is in the RNA binding domain (RBD), the search was extended to CDD and Prosite Databases to extract other names for the protein’s domain. The five data lists that have been created were as follows:

data list 1: [‘Tat-SF1’, ‘Tat’, ‘HTATSF1’].

data list 2: [‘RRM’, ‘RNA recognition motif (RRM) domain’, ‘binding RNA’].

data list 3: [‘F298L’, ‘F-to-L (mutations OR substitutions)’, ‘Phenylalanine to Leucine (mutations OR substitutions)’].

data list 4: [‘UAF homology motif (UHM)’, ‘17S U2 SnRNP complex component HTATSF1’].

data list 5: [].

The fifth data list was empty because the variant is not a PTM site. Finally, a comprehensive data list was generated by combining all the terms from the four data lists using the logical operator AND:

[‘RRM Structure’, ‘Tat-SF1 AND RRM’, ‘Tat AND RRM’, ‘HTATSF1 AND RRM’, ‘HTATSF1 AND F298L’, ‘HTATSF1 AND F-to-L (mutations OR substitutions)’, ‘HTATSF1 AND Phenylalanine to Leucine (mutations OR substitutions)’, ‘RNA recognition motif (RRM) domain Structure’, ‘Tat-SF1 AND RNA recognition motif (RRM) domain’, ‘Tat AND RNA recognition motif (RRM) domain’, ‘HTATSF1 AND RNA recognition motif (RRM) domain’, ‘binding RNA Structure’, ‘Tat-SF1 AND binding RNA’, ‘Tat AND binding RNA’, ‘HTATSF1 AND binding RNA’, ‘UAF homology motif (UHM) AND RRM’, ‘UAF homology motif (UHM) AND RNA recognition motif (RRM) domain’, ‘UAF homology motif (UHM) AND binding RNA’, ‘17S U2 SnRNP complex component HTATSF1 AND RRM’, ‘17S U2 SnRNP complex component HTATSF1 AND RNA recognition motif (RRM) domain’, ‘17S U2 SnRNP complex component HTATSF1 AND binding RNA’, ‘Tat-SF1’, ‘Tat’, ‘HTATSF1’, ‘RRM’, ‘RNA recognition motif (RRM) domain’, ‘binding RNA’, ‘UAF homology motif (UHM)’, ‘17S U2 SnRNP complex component HTATSF1’]. The provided terms served as search terms that were used in the PubMed database to retrieve every work associated with this specific protein, domain, and genetic variant. As a result, the number of studies after using these search terms in the PubMed database was 15,362.

Compared to manual curation, a search was made in the UniProt database to obtain information on the HTAATSF1 protein. This search revealed that position 298 falls within the range of the RNA binding domain. Because protein domains are conserved and usually function independently of the rest of the protein, we searched the PubMed database for the terms ‘RNA binding domain AND Phe to Leu substitution OR mutation’ to see if any previous research had been done on this domain. Startlingly, we discovered the same phenylalanine to leucine substitution in the RNA binding domain of the FUS protein in an article. The article listed four phenylalanine to leucine substitution positions (F305L, F341L, F359L, and F368L). In the Drosophila model of amyotrophic lateral sclerosis (ALS) and neuronal cell lines, these substitutions were sufficient to inhibit FUS protein binding to the RNA molecule [76]. Both HTATSF1 and FUS share a conserved RNA recognition motif domain and function, and they are both RNA binding proteins. The RNA binding domain is made up of four beta-sheets connected by two alpha-helices. The loops between the beta sheets and alpha helices are also involved in the recognition and binding of the RNA sequence, specifically the aromatic residues [77] According to the HTATSF1 structure in the PDB database, phenylalanine is an aromatic residue located in the loop between the first alpha-helix and the second beta-sheet. We performed a pairwise structure alignment to determine which of the four phenylalanine residues (F305L, F341L, F359L, F368L) in the RNA binding domain of FUS protein aligned with the phenylalanine (F298L) in the RNA binding domain of the HTATSF1 protein. The RNA binding domain polypeptide sequence and structure in FUS and HTATSF1 proteins can be found in the UniProt database. We used NCBI BLASTP and NCBI VAST Similar Structures [78] to align the two domains. With a p-value of 7 × 10⁻⁷, 41 percent identities, and 6 percent gaps, both tools found significant similarities between the sequences of the two domains. There are two amino acid gaps between phenylalanine (F298) in HTATSF1 and (F305) in FUS (Figure A1). Highlighting the importance of HTATSF1, it is involved in numerous processes of protein synthesis, including transcription and neuroectoderm differentiation. As a result of removing introns from ribosomal gene mRNA, it regulates the levels, processing, and regulation of the ribosome subunits ‘40 S, 60 S, and 80 S’ throughout embryogenesis (the development of the embryo) [79].

Figure A1. Using iCn3D structure alignment viewer [80,81] to view the structure alignment of the RNA binding domain in FUS protein with the RNA binding domain in HTATSF1 protein, highlighting the phenylalanine F305 and F298 in both FUS and HTATSF1, respectively, in yellow color. * marks the fifth amino acid.

Appendix A.2. OR6C65

Olfactory Receptor Family 6 Subfamily C Member 65 (OR6C65) is a receptor and classified as class A rhodopsin-like family of G-protein-coupled receptors (GPCRs). It triggers the neuronal response to the sense of smell in the nose by interacting with odorant molecules [82]. OR6C65 shares the structure of GPCR in seven transmembrane (7TM) domain. The seven transmembrane domain is conserved among the olfactory receptors. The genetic mutation of this protein is a frameshift mutation of methionine located at position 256 (M256fs). Using the automated search we developed earlier, we found that the genetic variant is located in transmembrane 6 (6TM) according to the UniProt database and considered as a part of the putative ligand-binding pocket according to the Conserved Domains database in NCBI [83]. More information was extracted from the UniProt database such as the short names, full names, the transmembrane name, and if the variant is a PTM site. The five data lists created were as follows:

data list 1: [‘OR6C65’].

data list 3: [‘M256fs’, ‘M (mutations)’, ‘Methionine (mutations)’].

data list 4: [‘transmembrane 6’, ‘Alpha-helical transmembrane 6’, ‘Olfactory receptor 6C65’].

Data list 2 was empty because the automated approach extracts the domain name, if any, and there was not a domain name, and data list 5 was empty because it contains the PTM sites, if any, and there, the variant was not a PTM site. Again, a mix of search terms with the logical operator AND were created to be used in the PubMed database so as to extract every study about this protein and its variant. The following search terms were used: [‘transmembrane 6 AND M256fs’, ‘transmembrane 6 AND M (mutations)’, ‘transmembrane 6 AND Met (mutations)’, ‘transmembrane 6 AND Methionine (mutations)’, ‘Alpha-helical transmembrane 6 AND M256fs’, ‘Alpha-helical transmembrane 6 AND M (mutations)’, ‘Alpha-helical transmembrane 6 AND Met (mutations)’, ‘Alpha-helical transmembrane 6 AND Methionine (mutations)’, ‘Olfactory receptor 6C65 AND M256fs’, ‘Olfactory receptor 6C65 AND M (mutations)’, ‘Olfactory receptor 6C65 AND Met (mutations)’, ‘Olfactory receptor 6C65 AND Methionine (mutations)’, ‘OR6C65 AND M256fs’, ‘OR6C65 AND M (mutations)’, ‘OR6C65 AND Met (mutations)’, ‘OR6C65 AND Methionine (mutations)’, ‘OR6C65’, ‘transmembrane 6’, ‘Alpha-helical transmembrane 6’, ‘Olfactory receptor 6C65’], and the total number of studies found was 10,614.

From the manual investigation to the location of this methionine residue, we located it at transmembrane 6 (6TM) and considered it as a part of the putative ligand-binding pocket according to the NCBI Conserved Domains database [83]. Although OR6C65 was not among the proteins in the sequence alignment shown on the cd15912 page from the Conserved Domains database, we aligned the first protein OR6C1 that appeaeared on the sequence alignment with OR6C65, and we found that both were highly conserved on the 6TM domain sequence and the other TMs. The deletion of adenosine nucleotide in position 766 in the DNA sequence caused a disruption in the frame reading, which disrupted the translated amino acid sequence and M256 frameshift. The sense of smell, like other senses, has a vital role in our life. It is involved in the learning process, which can affect our behaviors socially and psychologically. It all begins when the odorant molecules transfer to the human airway and bind to the olfactory receptor neurons in the nose, which opens the cyclic-nucleotide-gated ion channel, allowing Ca²⁺ and Na⁺ to transfer inside the cell, causing an action potential on the cell membrane that stimulates neurotransmitter release through synapses and prompts neuronal growth [84,85,86]. Polarization and depolarization of the neuron membrane by the olfactory receptor are very important and cause cell-to-cell communication [87]. Disruption of this process may lead to affect neuro-vegetative regions [85,88]. Previous studies have agreed on the disturbance of olfactory function in children and adults with autism [89,90,91].

Appendix A.3. ITIH6

Inter-Alpha-Trypsin Inhibitor Heavy Chain Family Member 6 (ITIH6) functions as a protease inhibitor found in the plasma membrane. It is composed of two main chains. The first chain is the light chain, which functions as a protease inhibitor. The second chain is the heavy chain, which comprises the vault protein inter-alpha-trypsin (VIT) domain and von Willebrand factor type A (VWFA) domain, functioning as a stabilizer for the extracellular matrix. The genetic variation of this protein is missense mutation in position 385, where serine is substituted with glycine (S385G). From the automated curation method we developed earlier for all protein databases, the genetic mutation is located in the von Willebrand factor type A (VWFA) domain. The data lists created were as follows:

data list 1: [‘Inter-alpha inhibitor H5-like ‘, ‘Interinhibitor H5’, ‘ITIH6’].

data list 2: [‘VWFA’, ‘VWFA’, ‘Undefined’, ‘VIT domain’, ‘The VIT domain can be regarded as the characteristic domain of the inter-alpha-trypsin inhibitor heavy chain (ITIH) family.’].

data list 3: [‘S385G’, ‘S-to-G (mutations OR substitutions)’, ‘Serine to Glycine (mutations OR substitutions)’].

data list 4: [‘Inter-alpha-trypsin inhibitor heavy chain H6’].

All the search terms used to search about this protein and its genetic variant were a combination of all the terms in these data lists, and they are as follows: [‘VWFA Structure’, ‘Inter-alpha inhibitor H5-like AND VWFA’, ‘Interinhibitor H5 AND VWFA’, ‘ITIH6 AND VWFA’, ‘ITIH6 AND S385G’, ‘ITIH6 AND S-to-G (mutations OR substitutions)’, ‘ITIH6 AND Serine to Glycine (mutations OR substitutions)’, ‘Undefined Structure’, ‘Inter-alpha inhibitor H5-like AND Undefined’, ‘Interinhibitor H5 AND Undefined’, ‘ITIH6 AND Undefined’, ‘VIT domain Structure’, ‘Inter-alpha inhibitor H5-like AND VIT domain’, ‘Interinhibitor H5 AND VIT domain’, ‘ITIH6 AND VIT domain’, ‘The VIT domain can be regarded as the characteristic domain of the inter-alpha-trypsin inhibitor heavy chain (ITIH) family. Structure’, ‘Inter-alpha inhibitor H5-like AND The VIT domain can be regarded as the characteristic domain of the inter-alpha-trypsin inhibitor heavy chain (ITIH) family.’, ‘Interinhibitor H5 AND The VIT domain can be regarded as the characteristic domain of the inter-alpha-trypsin inhibitor heavy chain (ITIH) family.’, ‘ITIH6 AND The VIT domain can be regarded as the characteristic domain of the inter-alpha-trypsin inhibitor heavy chain (ITIH) family.’, ‘Inter-alpha-trypsin inhibitor heavy chain H6 AND VWFA’, ‘Inter-alpha-trypsin inhibitor heavy chain H6 AND Undefined’, ‘Inter-alpha-trypsin inhibitor heavy chain H6 AND VIT domain’, ‘Inter-alpha-trypsin inhibitor heavy chain H6 AND The VIT domain can be regarded as the characteristic domain of the inter-alpha-trypsin inhibitor heavy chain (ITIH) family.’, ‘Inter-alpha-trypsin inhibitor heavy chain H6 AND S385G’, ‘Inter-alpha-trypsin inhibitor heavy chain H6 AND S-to-G (mutations OR substitutions)’, ‘Inter-alpha-trypsin inhibitor heavy chain H6 AND Serine to Glycine (mutations OR substitutions)’, ‘Inter-alpha inhibitor H5-like AND S385G’, ‘Inter-alpha inhibitor H5-like AND S-to-G (mutations OR substitutions)’, ‘Inter-alpha inhibitor H5-like AND Serine to Glycine (mutations OR substitutions)’, ‘Interinhibitor H5 AND S385G’, ‘Interinhibitor H5 AND S-to-G (mutations OR substitutions)’, ‘Interinhibitor H5 AND Serine to Glycine (mutations OR substitutions)’, ‘Inter-alpha inhibitor H5-like ‘, ‘Interinhibitor H5’, ‘ITIH6’, ‘VWFA’, ‘Undefined’, ‘VIT domain’, ‘The VIT domain can be regarded as the characteristic domain of the inter-alpha-trypsin inhibitor heavy chain (ITIH) family.’, ‘Inter-alpha-trypsin inhibitor heavy chain H6’]. The total number of studies found about this protein and its genetic variant was 7406.

The manual curation clarifies that the VWFA domain not only is involved in stabilizing the extracellular matrix, but it is also involved in other cellular functions such as cell adhesion and migration, hemostasis, and signal transduction [92]. VWFA is a domain consisting of lined alpha-helices with beta-sheets. Also, it contains a metal-ion-dependent adhesion site (MIDAS) that binds to different metal ions. The ligand binding or ligand recognizing site has been identified previously on many proteins containing the VWFA domain, and it is between two polypeptide sequences forming a loop between beta-sheet1 and alpha-helix1, and beta-sheet4 and alpha-helix5 [93]. The VWFA domain is conserved in all proteins that contain this domain, and the mutation of S385G is located in the loop connected to beta-sheet 4. If the mutation is predicted to cause harmful effects, then the mutation leads to disrupting the ligand-binding site and inhibit hyaluronic acid binding [94], as well as destabilizing the extracellular matrix [95,96,97]. As a result, this would cause inflammation and dysfunction in neuronal cells, which results in neurodegenerative disorders such as autism [98,99]. ITIH binding is important in stabilizing the extracellular matrix through the perineuronal net. The perineuronal net is one of the kinds of extracellular matrix in the brain and spinal cells. At the development phase of the neuronal network, perineuronal net formation is the last step to support the structure of the neurons. Chondroitin sulfate proteoglycans, hyaluronic acid, and its glycan-binding protein (ITIH) are the major component of perineuronal nets where chondroitin sulfate proteoglycan bind to hyaluronic acid, and ITIH stabilizes the binding between the two molecules. Mutation in ITIH leads to the loss of perineuronal net development and chondroitin sulfate proteoglycan localization [100].

Appendix A.4. DDX26B

Integrator Complex Subunit 6 Like (INTS6L) or ‘DDX26B’ is an RNA helicase entitled as a DEAD (Asp-Glu-Ala-Asp) box protein. It is a part of a 14 integrator complex protein family that is associated with the C terminal of RNA polymerase II, through S2 and S7 phosphorylation, that has many biological processes related to RNA such as ribosome arrangement, RNA transcription, splicing, and translation [101,102,103]. The genetic mutation mentioned of this protein is in position 435, where glutamic acid is replaced with valine (E435V). According to the automated curation, the variant is not located in any of the functional locations in the protein. The data lists created are as follows:

Data list1: [‘Integrator complex subunit 6-like’, ‘DDX26B’, ‘Integrator complex subunit 6’, ‘DDX26B’, ‘INTS6L’].

Data list 3: [‘E435V’, ‘E-to-V (mutations OR substitutions)’, ‘Glutamate to Valine (mutations OR substitutions)’] [‘Integrator complex subunit 6-like’].

Data list4: [‘Integrator complex subunit 6-like’].

The search terms created from these data lists are as follows: [‘Integrator complex subunit 6-like AND E435V’, ‘Integrator complex subunit 6-like AND E-to-V (mutations OR substitutions)’, ‘Integrator complex subunit 6-like AND Glutamate to Valine (mutations OR substitutions)’, ‘DDX26B AND E435V’, ‘DDX26B AND E-to-V (mutations OR substitutions)’, ‘DDX26B AND Glutamate to Valine (mutations OR substitutions)’, ‘Integrator complex subunit 6 AND E435V’, ‘Integrator complex subunit 6 AND E-to-V (mutations OR substitutions)’, ‘Integrator complex subunit 6 AND Glutamate to Valine (mutations OR substitutions)’, ‘INTS6L AND E435V’, ‘INTS6L AND E-to-V (mutations OR substitutions)’, ‘INTS6L AND Glutamate to Valine (mutations OR substitutions)’, ‘Integrator complex subunit 6-like’, ‘DDX26B’, ‘Integrator complex subunit 6’, ‘INTS6L’]. The total number of research found about this protein was 975.

From our manual curation, there are differences in structure and function between the integrator complex proteins, as the INTS6L protein has a von Willebrand factor type A (VWFA) domain in the N-terminus positioned between aa 3 and aa 227 according to the UniProt database. Also, there is a C-terminus domain for this protein between aa 777 and aa 838, according to the conserved domain database [104], which serves as a protein–protein binding motif to bind to INTS3 in response to DNA damage [105]. The DEAD-box is located between aa 597 and aa 600; functions as an RNA helicase; and is involved in RNA metabolism, where it interacts with RNA polymerase II and processes the 3′ end of snRNA [106], although some studies were unable to observe the helicase activity in vitro [105] and in vivo [107]. The importance of INTS6 and INTS6L in dynein binding to the nucleus membrane for cell division and development, as well as chromosome arrangement during oocyte maturation [106], has been mentioned. During embryogenesis in zebrafish, another essential role of INTS6 and INTS6L is the restrictive dorsal cell growth affecting the early development of the nervous system through Wnt and BMP signaling pathways [108]. Also, the expression of INTS6 has a key role in the differentiation of adipose tissues during maturation in 3T3-L1 cells [109]. The mutation mentioned in the study by [68] for this protein is in position 435 in the polypeptide sequence, where glutamic acid was replaced with valine (E435V). There is not enough evidence that confirms the effect of this variant on the INTS6L function. INTS6 and its paralog INTS6L are essential in stabilizing the integrator complex structure in order to support their function in binding to RNA polymerase II and targeting specific genes [110]. Also, this alteration would play a role relating to the expression of WIF-1, where the upregulation of INTS6 increases the expression of WIF-1 and inhibits the Wnt/β-catenin pathway in hepatocytes [111] and prostate cancer [107]. The Wnt/β-catenin pathway has been mentioned previously as causing autism. Also, the overlapping between cancer and autism risk genes were confirmed [112], and this means that the downregulation of INTS6 decreases the expression of WIF-1 and the accumulation of β-catenin, which would cause the autism phenotype.

References

Goh, G.; Choi, M. Application of whole exome sequencing to identify disease-causing variants in inherited human diseases. Genom. Inform. 2012, 10, 214–219. [Google Scholar] [CrossRef] [PubMed]
Kereszturi, É. Diversity and Classification of Genetic Variations in Autism Spectrum Disorder. Int. J. Mol. Sci. 2023, 24, 16768. [Google Scholar] [CrossRef]
Le Guen, Y.; Belloy, M.E.; Grenier-Boley, B.; de Rojas, I.; Castillo-Morales, A.; Jansen, I.; Nicolas, A.; Bellenguez, C.; Dalmasso, C.; Küçükali, F.; et al. Association of Rare APOE Missense Variants V236E and R251G With Risk of Alzheimer Disease. JAMA Neurol. 2022, 79, 652–663. [Google Scholar] [CrossRef]
Feliciano, P.; Zhou, X.; Astrovskaya, I.; Turner, T.N.; Wang, T.; Brueggeman, L.; Barnard, R.; Hsieh, A.; Snyder, L.G.; Muzny, D.M.; et al. Exome sequencing of 457 autism families recruited online provides evidence for autism risk genes. NPJ Genom. Med. 2019, 4, 19. [Google Scholar] [CrossRef] [PubMed]
Husson, T.; Lecoquierre, F.; Cassinari, K.; Charbonnier, C.; Quenez, O.; Goldenberg, A.; Guerrot, A.M.; Richard, A.C.; Drouin-Garraud, V.; Brehin, A.C.; et al. Rare genetic susceptibility variants assessment in autism spectrum disorder: Detection rate and practical use. Transl. Psychiatry 2020, 10, 77. [Google Scholar] [CrossRef] [PubMed]
Zhao, B.; Jiang, Q.; Lin, J.; Wei, Q.; Li, C.; Hou, Y.; Cao, B.; Zhang, L.; Ou, R.; Liu, K.; et al. TBK1 variants in Chinese patients with amyotrophic lateral sclerosis: Genetic analysis and clinical features. Eur. J. Neurol. 2023, 30, 3079–3089. [Google Scholar] [CrossRef]
Vihinen, M. Functional effects of protein variants. Biochimie 2021, 180, 104–120. [Google Scholar] [CrossRef]
Milacic, M.; Beavers, D.; Conley, P.; Gong, C.; Gillespie, M.; Griss, J.; Haw, R.; Jassal, B.; Matthews, L.; May, B.; et al. The Reactome Pathway Knowledgebase 2024. Nucleic Acids Res. 2024, 52, D672–D678. [Google Scholar] [CrossRef]
Liu, Z.; Qian, W.; Cai, W.; Song, W.; Wang, W.; Maharjan, D.T.; Cheng, W.; Chen, J.; Wang, H.; Xu, D.; et al. Inferring the Effects of Protein Variants on Protein-Protein Interactions with Interpretable Transformer Representations. Research 2023, 6, 0219. [Google Scholar] [CrossRef]
Ali, S.; Ali, U.; Qamar, A.; Zafar, I.; Yaqoob, M.; Ain, Q.U.; Rashid, S.; Sharma, R.; Nafidi, H.A.; Bin Jardan, Y.A.; et al. Predicting the effects of rare genetic variants on oncogenic signaling pathways: A computational analysis of HRAS protein function. Front. Chem. 2023, 11, 1173624. [Google Scholar] [CrossRef]
Sun, Y.V. Integration of biological networks and pathways with genetic association studies. Hum. Genet. 2012, 131, 1677–1686. [Google Scholar] [CrossRef] [PubMed][Green Version]
Ahmed, F.; Samantasinghar, A.; Soomro, A.M.; Kim, S.; Choi, K.H. A systematic review of computational approaches to understand cancer biology for informed drug repurposing. J. Biomed. Inform. 2023, 142, 104373. [Google Scholar] [CrossRef] [PubMed]
Nezamuldeen, L.; Jafri, M.S. Protein–Protein Interaction Network Extraction Using Text Mining Methods Adds Insight into Autism Spectrum Disorder. Biology 2023, 12, 1344. [Google Scholar] [CrossRef] [PubMed]
Sherry, S.T.; Ward, M.-H.; Kholodov, M.; Baker, J.; Phan, L.; Smigielski, E.M.; Sirotkin, K. dbSNP: The NCBI database of genetic variation. Nucleic Acids Res. 2001, 29, 308–311. [Google Scholar] [CrossRef] [PubMed]
Allot, A.; Peng, Y.; Wei, C.-H.; Lee, K.; Phan, L.; Lu, Z. LitVar: A semantic search engine for linking genomic variant data in PubMed and PMC. Nucleic Acids Res. 2018, 46, W530–W536. [Google Scholar] [CrossRef] [PubMed]
Vaser, R.; Adusumalli, S.; Leng, S.N.; Sikic, M.; Ng, P.C. SIFT missense predictions for genomes. Nat. Protoc. 2016, 11, 1. [Google Scholar] [CrossRef] [PubMed]
Choi, Y.; Sims, G.E.; Murphy, S.; Miller, J.R.; Chan, A.P. Predicting the functional effect of amino acid substitutions and indels. PLoS ONE 2012, 7, e46688. [Google Scholar] [CrossRef] [PubMed]
Adzhubei, I.A.; Schmidt, S.; Peshkin, L.; Ramensky, V.E.; Gerasimova, A.; Bork, P.; Kondrashov, A.S.; Sunyaev, S.R. A method and server for predicting damaging missense mutations. Nat. Methods 2010, 7, 248–249. [Google Scholar] [CrossRef]
Cheng, J.; Novati, G.; Pan, J.; Bycroft, C.; Žemgulytė, A.; Applebaum, T.; Pritzel, A.; Wong, L.H.; Zielinski, M.; Sargeant, T.; et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 2023, 381, eadg7492. [Google Scholar] [CrossRef]
Bendl, J.; Stourac, J.; Salanda, O.; Pavelka, A.; Wieben, E.D.; Zendulka, J.; Brezovsky, J.; Damborsky, J. PredictSNP: Robust and Accurate Consensus Classifier for Prediction of Disease-Related Mutations. PLoS Comput. Biol. 2014, 10, e1003440. [Google Scholar] [CrossRef]
McCoy, M.D.; Hamre, J., 3rd; Klimov, D.K.; Jafri, M.S. Predicting Genetic Variation Severity Using Machine Learning to Interpret Molecular Simulations. Biophys. J. 2021, 120, 189–204. [Google Scholar] [CrossRef] [PubMed]
Hamre, J.R., 3rd; Klimov, D.K.; McCoy, M.D.; Jafri, M.S. Machine learning-based prediction of drug and ligand binding in BCL-2 variants through molecular dynamics. Comput. Biol. Med. 2022, 140, 105060. [Google Scholar] [CrossRef] [PubMed]
Schwarz, J.M.; Hombach, D.; Köhler, S.; Cooper, D.N.; Schuelke, M.; Seelow, D. RegulationSpotter: Annotation and interpretation of extratranscriptic DNA variants. Nucleic Acids Res. 2019, 47, W106–W113. [Google Scholar] [CrossRef] [PubMed]
Parthiban, V.; Gromiha, M.M.; Schomburg, D. CUPSAT: Prediction of protein stability upon point mutations. Nucleic Acids Res. 2006, 34, W239–W242. [Google Scholar] [CrossRef]
Pejaver, V.; Urresti, J.; Lugo-Martinez, J.; Pagel, K.A.; Lin, G.N.; Nam, H.-J.; Mort, M.; Cooper, D.N.; Sebat, J.; Iakoucheva, L.M. MutPred2: Inferring the molecular and phenotypic impact of amino acid variants. bioRxiv 2017, 134981. [Google Scholar] [CrossRef] [PubMed]
López-Ferrando, V.; Gazzo, A.; De La Cruz, X.; Orozco, M.; Gelpí, J.L. PMut: A web-based tool for the annotation of pathological variants on proteins, 2017 update. Nucleic Acids Res. 2017, 45, W222–W228. [Google Scholar] [CrossRef] [PubMed]
Masso, M.; Vaisman, I.I. AUTO-MUTE: Web-based tools for predicting stability changes in proteins due to single amino acid replacements. Protein Eng. Des. Sel. 2010, 23, 683–687. [Google Scholar] [CrossRef] [PubMed]
Schymkowitz, J.; Borg, J.; Stricher, F.; Nys, R.; Rousseau, F.; Serrano, L. The FoldX web server: An online force field. Nucleic Acids Res. 2005, 33, W382–W388. [Google Scholar] [CrossRef]
Benedix, A.; Becker, C.M.; de Groot, B.L.; Caflisch, A.; Böckmann, R.A. Predicting free energy changes using structural ensembles. Nat. Methods 2009, 6, 3. [Google Scholar] [CrossRef]
Song, Y.; Di Maio, F.; Wang, R.Y.-R.; Kim, D.; Miles, C.; Brunette, T.; Thompson, J.; Baker, D. High-resolution comparative modeling with RosettaCM. Structure 2013, 21, 1735–1742. [Google Scholar] [CrossRef]
Capriotti, E.; Fariselli, P.; Casadio, R. I-Mutant2.0: Predicting stability changes upon mutation from the protein sequence or structure. Nucleic Acids Res. 2005, 33, W306–W310. [Google Scholar] [CrossRef] [PubMed]
Potapov, V.; Cohen, M.; Schreiber, G. Assessing computational methods for predicting protein stability upon mutation: Good on average but not in the details. Protein Eng. Des. Sel. 2009, 22, 553–560. [Google Scholar] [CrossRef] [PubMed]
Palli, R.; Palshikar, M.G.; Thakar, J. Executable pathway analysis using ensemble discrete-state modeling for large-scale data. PLoS Comput. Biol. 2019, 15, e1007317. [Google Scholar] [CrossRef] [PubMed]
Kanehisa, M.; Furumichi, M.; Sato, Y.; Kawashima, M.; Ishiguro-Watanabe, M. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res. 2023, 51, D587–D592. [Google Scholar] [CrossRef] [PubMed]
Oughtred, R.; Rust, J.; Chang, C.; Breitkreutz, B.J.; Stark, C.; Willems, A.; Boucher, L.; Leung, G.; Kolas, N.; Zhang, F.; et al. The BioGRID database: A comprehensive biomedical resource of curated protein, genetic, and chemical interactions. Protein Sci. 2021, 30, 187–200. [Google Scholar] [CrossRef] [PubMed]
Cerami, E.G.; Gross, B.E.; Demir, E.; Rodchenkov, I.; Babur, Ö.; Anwar, N.; Schultz, N.; Bader, G.D.; Sander, C. Pathway Commons, a web resource for biological pathway data. Nucleic Acids Res. 2010, 39, D685–D690. [Google Scholar] [CrossRef] [PubMed]
Demir, E.; Cary, M.P.; Paley, S.; Fukuda, K.; Lemer, C.; Vastrik, I.; Wu, G.; D’eustachio, P.; Schaefer, C.; Luciano, J. The BioPAX community standard for pathway data sharing. Nat. Biotechnol. 2010, 28, 935. [Google Scholar] [CrossRef] [PubMed]
Shannon, P.; Markiel, A.; Ozier, O.; Baliga, N.S.; Wang, J.T.; Ramage, D.; Amin, N.; Schwikowski, B.; Ideker, T. Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res. 2003, 13, 2498–2504. [Google Scholar] [CrossRef]
Schwab, J.D.; Kühlwein, S.D.; Ikonomi, N.; Kühl, M.; Kestler, H.A. Concepts in Boolean network modeling: What do they all mean? Comput. Struct. Biotechnol. J. 2020, 18, 571–582. [Google Scholar] [CrossRef]
Veliz-Cuba, A.; Aguilar, B.; Hinkelmann, F.; Laubenbacher, R. Steady state analysis of Boolean molecular network models via model reduction and computational algebra. BMC Bioinform. 2014, 15, 221. [Google Scholar] [CrossRef]
Irurzun-Arana, I.; Pastor, J.M.; Trocóniz, I.F.; Gómez-Mantilla, J.D. Advanced Boolean modeling of biological networks applied to systems pharmacology. Bioinformatics 2017, 33, 1040–1048. [Google Scholar] [CrossRef] [PubMed]
Nezamuldeen, L.; Jafri, M.S. Boolean Modeling of Biological Network Applied to Protein-Protein Interaction Network of Autism Patients. Biology 2024, 13, 606. [Google Scholar] [CrossRef]
Przybyła, P.; Shardlow, M.; Aubin, S.; Bossy, R.; Eckart de Castilho, R.; Piperidis, S.; McNaught, J.; Ananiadou, S. Text mining resources for the life sciences. Database 2016, 2016, baw145. [Google Scholar] [CrossRef] [PubMed]
Verspoor, K.M.; Cohn, J.D.; Ravikumar, K.E.; Wall, M.E. Text mining improves prediction of protein functional sites. PLoS ONE 2012, 7, e32171. [Google Scholar] [CrossRef] [PubMed]
Samandari Bahraseman, M.R.; Khorsand, B.; Esmaeilzadeh-Salestani, K.; Sarhadi, S.; Hatami, N.; Khaleghdoust, B.; Loit, E. The use of integrated text mining and protein-protein interaction approach to evaluate the effects of combined chemotherapeutic and chemopreventive agents in cancer therapy. PLoS ONE 2022, 17, e0276458. [Google Scholar] [CrossRef] [PubMed]
Wei, C.H.; Harris, B.R.; Kao, H.Y.; Lu, Z. tmVar: A text mining approach for extracting sequence variants in biomedical literature. Bioinformatics 2013, 29, 1433–1439. [Google Scholar] [CrossRef]
Alipanahi, B.; Delong, A.; Weirauch, M.T.; Frey, B.J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 2015, 33, 831–838. [Google Scholar] [CrossRef] [PubMed]
Salekin, S.; Zhang, J.M.; Huang, Y. A deep learning model for predicting transcription factor binding location at single nucleotide resolution. In Proceedings of the 2017 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI), Orland, FL, USA, 16–19 February 2017; pp. 57–60. [Google Scholar]
Zhou, J.; Troyanskaya, O.G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 2015, 12, 931–934. [Google Scholar] [CrossRef] [PubMed]
Gupta, A.; Rush, A.M. Dilated convolutions for modeling long-distance genomic dependencies. arXiv 2017, arXiv:1710.01278. [Google Scholar]
Quang, D.; Xie, X. DanQ: A hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 2016, 44, e107. [Google Scholar] [CrossRef]
Yang, B.; Liu, F.; Ren, C.; Ouyang, Z.; Xie, Z.; Bo, X.; Shu, W. BiRen: Predicting enhancers with a deep-learning-based model using the DNA sequence alone. Bioinformatics 2017, 33, 1930–1936. [Google Scholar] [CrossRef] [PubMed]
Shen, Z.; Bao, W.; Huang, D.S. Recurrent Neural Network for Predicting Transcription Factor Binding Sites. Sci. Rep. 2018, 8, 15270. [Google Scholar] [CrossRef]
Pan, X.; Rijnbeek, P.; Yan, J.; Shen, H.B. Prediction of RNA-protein sequence and structure binding preferences using deep convolutional and recurrent neural networks. BMC Genom. 2018, 19, 511. [Google Scholar] [CrossRef]
He, Y.; Shen, Z.; Zhang, Q.; Wang, S.; Huang, D.S. A survey on deep learning in DNA/RNA motif mining. Brief. Bioinform. 2021, 22, bbaa229. [Google Scholar] [CrossRef]
Kaddour, J.; Harris, J.; Mozes, M.; Bradley, H.; Raileanu, R.; McHardy, R. Challenges and applications of large language models. arXiv 2023, arXiv:2307.10169. [Google Scholar]
Lee, J.; Yoon, W.; Kim, S.; Kim, D.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef]
Luo, R.; Sun, L.; Xia, Y.; Qin, T.; Zhang, S.; Poon, H.; Liu, T.-Y. BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining. Brief. Bioinform. 2022, 23, bbac409. [Google Scholar] [CrossRef] [PubMed]
Szklarczyk, D.; Kirsch, R.; Koutrouli, M.; Nastou, K.; Mehryary, F.; Hachilif, R.; Gable, A.L.; Fang, T.; Doncheva, N.T.; Pyysalo, S.; et al. The STRING database in 2023: Protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 2023, 51, D638–D646. [Google Scholar] [CrossRef] [PubMed]
Zagirova, D.; Pushkov, S.; Leung, G.H.D.; Liu, B.H.M.; Urban, A.; Sidorenko, D.; Kalashnikov, A.; Kozlova, E.; Naumov, V.; Pun, F.W.; et al. Biomedical generative pre-trained based transformer language model for age-related disease target discovery. Aging 2023, 15, 9293–9309. [Google Scholar] [CrossRef]
Huang, L.; Lin, J.; Li, X.; Song, L.; Zheng, Z.; Wong, K.C. EGFI: Drug-drug interaction extraction and generation with fusion of enriched entity and sentence information. Brief. Bioinform. 2022, 23, bbab451. [Google Scholar] [CrossRef] [PubMed]
Karkera, N.; Acharya, S.; Palaniappan, S.K. Leveraging pre-trained language models for mining microbiome-disease relationships. BMC Bioinform. 2023, 24, 290. [Google Scholar] [CrossRef] [PubMed]
Das Baksi, K.; Pokhrel, V.; Pudavar, A.E.; Mande, S.S.; Kuntal, B.K. BactInt: A domain driven transfer learning approach for extracting inter-bacterial associations from biomedical text. Comput. Biol. Chem. 2024, 109, 108012. [Google Scholar] [CrossRef] [PubMed]
Philippidis, A. Nvidia Looks to Genentech for Its Next Leap in AI Drug Discovery: Roche subsidiary becomes newest biopharma partner for Silicon Valley giant as it grows life sciences footprint. GEN Edge 2023, 5, 828–833. [Google Scholar] [CrossRef]
Sevgen, E.; Moller, J.; Lange, A.; Parker, J.; Quigley, S.; Mayer, J.; Srivastava, P.; Gayatri, S.; Hosfield, D.; Korshunova, M. ProT-VAE: Protein transformer variational autoencoder for functional protein design. bioRxiv 2023. [Google Scholar] [CrossRef]
Wong, F.; de la Fuente-Nunez, C.; Collins, J.J. Leveraging artificial intelligence in the fight against infectious diseases. Science 2023, 381, 164–170. [Google Scholar] [CrossRef]
Roberts, J.B.; Nava, A.A.; Pearson, A.N.; Incha, M.R.; Valencia, L.E.; Ma, M.; Rao, A.; Keasling, J.D. Foldy: A web application for interactive protein structure analysis. bioRxiv 2023. [Google Scholar] [CrossRef]
Al-Mubarak, B.; Abouelhoda, M.; Omar, A.; AlDhalaan, H.; Aldosari, M.; Nester, M.; Alshamrani, H.A.; El-Kalioby, M.; Goljan, E.; Albar, R. Whole exome sequencing reveals inherited and de novo variants in autism spectrum disorder: A trio study from Saudi families. Sci. Rep. 2017, 7, 5679. [Google Scholar] [CrossRef]
Coudert, E.; Gehant, S.; de Castro, E.; Pozzato, M.; Baratin, D.; Neto, T.; Sigrist, C.J.A.; Redaschi, N.; Bridge, A.; Consortium, U. Annotation of biologically relevant ligands in UniProtKB using ChEBI. Bioinformatics 2023, 39, btac793. [Google Scholar] [CrossRef]
Requests: HTTP for Humans™. Available online: https://requests.readthedocs.io/en/latest/ (accessed on 1 July 2024).
Richardson, L. Beautiful Soup Documentation; April, 2007. Available online: https://tedboy.github.io/bs4_doc/ (accessed on 1 July 2024).
Huang, H.; Arighi, C.N.; Ross, K.E.; Ren, J.; Li, G.; Chen, S.C.; Wang, Q.; Cowart, J.; Vijay-Shanker, K.; Wu, C.H. iPTMnet: An integrated resource for protein post-translational modification network discovery. Nucleic Acids Res. 2018, 46, D542–D550. [Google Scholar] [CrossRef]
Ahmad, R.M.; Ali, B.R.; Al-Jasmi, F.; Sinnott, R.O.; Al Dhaheri, N.; Mohamad, M.S. A review of genetic variant databases and machine learning tools for predicting the pathogenicity of breast cancer. Brief. Bioinform. 2023, 25, bbad479. [Google Scholar] [CrossRef]
Berger, S.M.; Appelbaum, P.S.; Siegel, K.; Wynn, J.; Saami, A.M.; Brokamp, E.; O’Connor, B.C.; Hamid, R.; Martin, D.M.; Chung, W.K. Challenges of variant reinterpretation: Opinions of stakeholders and need for guidelines. Genet. Med. 2022, 24, 1878–1887. [Google Scholar] [CrossRef]
Garcia, F.A.O.; de Andrade, E.S.; Palmero, E.I. Insights on variant analysis. Front. Genet. 2022, 13, 1010327. [Google Scholar] [CrossRef]
Daigle, J.G.; Lanson, N.A., Jr.; Smith, R.B.; Casci, I.; Maltare, A.; Monaghan, J.; Nichols, C.D.; Kryndushkin, D.; Shewmaker, F.; Pandey, U.B. RNA-binding ability of FUS regulates neurodegeneration, cytoplasmic mislocalization and incorporation into stress granules associated with FUS carrying ALS-linked mutations. Hum. Mol. Genet. 2012, 22, 1193–1205. [Google Scholar] [CrossRef] [PubMed]
Cléry, A.; Allain, F.H.T. From structure to function of RNA binding domains. In Madame Curie Bioscience Database; Landes Bioscience: Austin, TX, USA, 2000–2013. [Google Scholar]
Madej, T.; Lanczycki, C.J.; Zhang, D.; Thiessen, P.A.; Geer, R.C.; Marchler-Bauer, A.; Bryant, S.H. MMDB and VAST+: Tracking structural similarities between macromolecular complexes. Nucleic Acids Res. 2014, 42, D297–D303. [Google Scholar] [CrossRef] [PubMed]
Corsini, N.S.; Peer, A.M.; Moeseneder, P.; Roiuk, M.; Burkard, T.R.; Theussl, H.C.; Moll, I.; Knoblich, J.A. Coordinated Control of mRNA and rRNA Processing Controls Embryonic Stem Cell Pluripotency and Differentiation. Cell Stem Cell 2018, 22, 543–558.e512. [Google Scholar] [CrossRef]
Wang, J.; Youkharibache, P.; Zhang, D.; Lanczycki, C.J.; Geer, R.C.; Madej, T.; Phan, L.; Ward, M.; Lu, S.; Marchler, G.H.; et al. iCn3D, a web-based 3D viewer for sharing 1D/2D/3D representations of biomolecular structures. Bioinformatics 2020, 36, 131–135. [Google Scholar] [CrossRef]
Wang, J.; Youkharibache, P.; Marchler-Bauer, A.; Lanczycki, C.; Zhang, D.; Lu, S.; Madej, T.; Marchler, G.H.; Cheng, T.; Chong, L.C.; et al. iCn3D: From Web-Based 3D Viewer to Structural Analysis Tool in Batch Mode. Front. Mol. Biosci. 2022, 9, 831740. [Google Scholar] [CrossRef] [PubMed]
Olender, T.; Lancet, D.; Nebert, D.W. Update on the olfactory receptor (OR) gene superfamily. Hum. Genom. 2008, 3, 87. [Google Scholar] [CrossRef]
Marchler-Bauer, A.; Bo, Y.; Han, L.; He, J.; Lanczycki, C.J.; Lu, S.; Chitsaz, F.; Derbyshire, M.K.; Geer, R.C.; Gonzales, N.R.; et al. CDD/SPARCLE: Functional classification of proteins via subfamily domain architectures. Nucleic Acids Res. 2017, 45, D200–D203. [Google Scholar] [CrossRef]
Ronnett, G.V.; Moon, C. G proteins and olfactory signal transduction. Annu. Rev. Physiol. 2002, 64, 189–222. [Google Scholar] [CrossRef]
Sarafoleanu, C.; Mella, C.; Georgescu, M.; Perederco, C. The importance of the olfactory sense in the human behavior and evolution. J. Med. Life 2009, 2, 196–198. [Google Scholar] [PubMed]
Rinaldi, A. The scent of life. The exquisite complexity of the sense of smell in animals and humans. EMBO Rep. 2007, 8, 629–633. [Google Scholar] [CrossRef] [PubMed]
Hedlund, B.; Masukawa, L.M.; Shepherd, G.M. Excitable properties of olfactory receptor neurons. J. Neurosci. 1987, 7, 2338–2343. [Google Scholar]
Tonacci, A.; Sansone, F.; Pala, A.P.; Centrone, A.; Napoli, F.; Domenici, C.; Conte, R. Effect of feeding on neurovegetative response to olfactory stimuli. In Proceedings of the 2017 E-Health and Bioengineering Conference (EHB), Sinaia, Romania, 22–24 June 2017; pp. 9–12. [Google Scholar]
Ashwin, C.; Chapman, E.; Howells, J.; Rhydderch, D.; Walker, I.; Baron-Cohen, S. Enhanced olfactory sensitivity in autism spectrum conditions. Mol. Autism 2014, 5, 53. [Google Scholar] [CrossRef]
Rozenkrantz, L.; Zachor, D.; Heller, I.; Plotkin, A.; Weissbrod, A.; Snitz, K.; Secundo, L.; Sobel, N. A Mechanistic Link between Olfaction and Autism Spectrum Disorder. Curr. Biol. 2015, 25, 1904–1910. [Google Scholar] [CrossRef]
Wicker, B.; Monfardini, E.; Royet, J.P. Olfactory processing in adults with autism spectrum disorders. Mol. Autism 2016, 7, 4. [Google Scholar] [CrossRef]
Zhuo, L.; Kimata, K. Structure and function of inter-α-trypsin inhibitor heavy chains. Connect. Tissue Res. 2008, 49, 311–320. [Google Scholar] [CrossRef]
Morikis, D.; Lambris, J.D. Structural Biology of the Complement System; CRC Press: Boca Raton, FL, USA, 2005. [Google Scholar]
Whittaker, C.A.; Hynes, R.O. Distribution and evolution of von Willebrand/integrin A domains: Widely dispersed domains with roles in cell adhesion and elsewhere. Mol. Biol. Cell 2002, 13, 3369–3387. [Google Scholar] [CrossRef] [PubMed]
Bost, F.; Diarra-Mehrpour, M.; Martin, J.P. Inter-alpha-trypsin inhibitor proteoglycan family—A group of proteins binding and stabilizing the extracellular matrix. Eur. J. Biochem. 1998, 252, 339–346. [Google Scholar] [CrossRef] [PubMed]
Huang, L.; Yoneda, M.; Kimata, K. A serum-derived hyaluronan-associated protein (SHAP) is the heavy chain of the inter alpha-trypsin inhibitor. J. Biol. Chem. 1993, 268, 26725–26730. [Google Scholar] [CrossRef]
Zhao, M.; Yoneda, M.; Ohashi, Y.; Kurono, S.; Iwata, H.; Ohnuki, Y.; Kimata, K. Evidence for the covalent binding of SHAP, heavy chains of inter-alpha-trypsin inhibitor, to hyaluronan. J. Biol. Chem. 1995, 270, 26657–26663. [Google Scholar] [CrossRef] [PubMed]
Gaudet, A.D.; Popovich, P.G. Extracellular matrix regulation of inflammation in the healthy and injured spinal cord. Exp. Neurol. 2014, 258, 24–34. [Google Scholar] [CrossRef] [PubMed]
Bonneh-Barkay, D.; Wiley, C.A. Brain extracellular matrix in neurodegeneration. Brain Pathol. 2009, 19, 573–585. [Google Scholar] [CrossRef] [PubMed]
Warren, P.M.; Dickens, S.M.; Gigout, S.; Fawcett, J.W.; Kwok, J.C.F. Regulation of CNS Plasticity Through the Extracellular Matrix. In The Oxford Handbook of Developmental Neural Plasticity; Chao, M.V., Ed.; Oxford University Press: New York, NY, USA, 2018. [Google Scholar]
Camargo, A.A.; Nunes, D.N.; Samaia, H.B.; Liu, L.; Collins, V.P.; Simpson, A.J.; Dias-Neto, E. Molecular characterization of DDX26, a human DEAD-box RNA helicase, located on chromosome 7p12. Braz. J. Med. Biol. Res. 2001, 34, 1237–1245. [Google Scholar] [CrossRef] [PubMed][Green Version]
Baillat, D.; Hakimi, M.A.; Näär, A.M.; Shilatifard, A.; Cooch, N.; Shiekhattar, R. Integrator, a multiprotein mediator of small nuclear RNA processing, associates with the C-terminal repeat of RNA polymerase II. Cell 2005, 123, 265–276. [Google Scholar] [CrossRef] [PubMed]
Baillat, D.; Wagner, E.J. Integrator: Surprisingly diverse functions in gene expression. Trends Biochem. Sci. 2015, 40, 257–264. [Google Scholar] [CrossRef] [PubMed]
Marchler-Bauer, A.; Zheng, C.; Chitsaz, F.; Derbyshire, M.K.; Geer, L.Y.; Geer, R.C.; Gonzales, N.R.; Gwadz, M.; Hurwitz, D.I.; Lanczycki, C.J.; et al. CDD: Conserved domains and protein three-dimensional structure. Nucleic Acids Res. 2013, 41, D348–D352. [Google Scholar] [CrossRef]
Zhang, F.; Ma, T.; Yu, X. A core hSSB1-INTS complex participates in the DNA damage response. J. Cell Sci. 2013, 126, 4850–4855. [Google Scholar] [CrossRef]
Jodoin, J.N.; Sitaram, P.; Albrecht, T.R.; May, S.B.; Shboul, M.; Lee, E.; Reversade, B.; Wagner, E.J.; Lee, L.A. Nuclear-localized Asunder regulates cytoplasmic dynein localization via its role in the integrator complex. Mol. Biol. Cell 2013, 24, 2954–2965. [Google Scholar] [CrossRef] [PubMed]
Chen, J.; Wagner, E.J. snRNA 3′ end formation: The dawn of the Integrator complex. Biochem. Soc. Trans. 2010, 38, 1082–1087. [Google Scholar] [CrossRef]
Kapp, L.D.; Abrams, E.W.; Marlow, F.L.; Mullins, M.C. The integrator complex subunit 6 (Ints6) confines the dorsal organizer in vertebrate embryogenesis. PLoS Genet. 2013, 9, e1003822. [Google Scholar] [CrossRef] [PubMed]
Otani, Y.; Nakatsu, Y.; Sakoda, H.; Fukushima, T.; Fujishiro, M.; Kushiyama, A.; Okubo, H.; Tsuchiya, Y.; Ohno, H.; Takahashi, S.; et al. Integrator complex plays an essential role in adipose differentiation. Biochem. Biophys. Res. Commun. 2013, 434, 197–202. [Google Scholar] [CrossRef] [PubMed]
Skaar, J.R.; Ferris, A.L.; Wu, X.; Saraf, A.; Khanna, K.K.; Florens, L.; Washburn, M.P.; Hughes, S.H.; Pagano, M. The Integrator complex controls the termination of transcription at diverse classes of gene targets. Cell Res. 2015, 25, 288–305. [Google Scholar] [CrossRef] [PubMed]
Lui, K.Y.; Zhao, H.; Qiu, C.; Li, C.; Zhang, Z.; Peng, H.; Fu, R.; Chen, H.A.; Lu, M.Q. Integrator complex subunit 6 (INTS6) inhibits hepatocellular carcinoma growth by Wnt pathway and serve as a prognostic marker. BMC Cancer 2017, 17, 644. [Google Scholar] [CrossRef] [PubMed]
Crawley, J.N.; Heyer, W.D.; LaSalle, J.M. Autism and Cancer Share Risk Genes, Pathways, and Drug Targets. Trends Genet. 2016, 32, 139–146. [Google Scholar] [CrossRef]

Figure 1. A schematic representation of the workflow developed to gather more information about the genetic variant and its effect on protein function.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nezamuldeen, L.; Jafri, M.S. Text Mining to Understand Disease-Causing Gene Variants. Knowledge 2024, 4, 422-443. https://doi.org/10.3390/knowledge4030023

AMA Style

Nezamuldeen L, Jafri MS. Text Mining to Understand Disease-Causing Gene Variants. Knowledge. 2024; 4(3):422-443. https://doi.org/10.3390/knowledge4030023

Chicago/Turabian Style

Nezamuldeen, Leena, and Mohsin Saleet Jafri. 2024. "Text Mining to Understand Disease-Causing Gene Variants" Knowledge 4, no. 3: 422-443. https://doi.org/10.3390/knowledge4030023

APA Style

Nezamuldeen, L., & Jafri, M. S. (2024). Text Mining to Understand Disease-Causing Gene Variants. Knowledge, 4(3), 422-443. https://doi.org/10.3390/knowledge4030023

Article Menu

Text Mining to Understand Disease-Causing Gene Variants

Abstract

1. Introduction

1.1. Background

1.2. Motivation

1.3. Objectives

2. Literature Review

2.1. Online Resources

2.1.1. Databases

2.1.2. Variant Classification Tools

2.1.3. Pathway Analysis Tools

2.2. Text Mining

2.3. Large Language Models

3. Examples and Case Studies

3.1. Autism Spectrum Disorder as an Example of the Implications of Genetic Variants

3.2. The Automated Curation

3.3. UniProt Database Curation

3.4. iPTMnet Database Curation

3.5. Arranging the Information

3.6. Knowledge Retrieval

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. HTATSF1

Appendix A.2. OR6C65

Appendix A.3. ITIH6

Appendix A.4. DDX26B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI