Next Article in Journal
Bone Tissue Engineering and Nanotechnology: A Promising Combination for Bone Regeneration
Previous Article in Journal
First Insights into Body Localization of an Osmoregulation-Related Cotransporter in Estuarine Annelids
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Improved LINE-1 Detection through Pattern Matching by Increasing Probe Length

by
Juan O. López
*,
Javier L. Quiñones
and
Emanuel D. Martínez
Department of Computer Science, University of Puerto Rico at Arecibo, Arecibo 00612, Puerto Rico
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Biology 2024, 13(4), 236; https://doi.org/10.3390/biology13040236
Submission received: 1 January 2024 / Revised: 22 February 2024 / Accepted: 4 March 2024 / Published: 2 April 2024

Abstract

:

Simple Summary

Long Interspersed Element-1 (LINE-1 or L1) is an autonomous transposable element, meaning that its DNA sequences are able to replicate themselves throughout the human genome. This activity may lead to genomic instability and is associated with several different diseases. Additionally, L1s are also capable of replicating other non-autonomous sequences, thereby increasing their disruptive impact. Although there are different tools available that may be used for L1 detection, the heuristics involved affect their accuracy. L1PD (LINE-1 Pattern Detection) uses a novel pattern-matching approach to detect L1s in human genomes, using a fixed set of k-mer probes of length 50 that were generated using the human reference genome GRCh38. This research aims to improve L1PD by using longer probes and testing whether this leads to better results. Additionally, experiments were performed to test the effectiveness of L1PD in detecting L1s in other species, such as dogs, horses, and cows. The results showed that longer probes did improve precision and recall of L1s, not only in humans but in the other species as well.

Abstract

Long Interspersed Element-1 (LINE-1 or L1) is an autonomous transposable element that accounts for 17% of the human genome. Strong correlations between abnormal L1 expression and diseases, particularly cancer, have been documented by numerous studies. L1PD (LINE-1 Pattern Detection) had been previously created to detect L1s by using a fixed pre-determined set of 50-mer probes and a pattern-matching algorithm. L1PD uses a novel seed-and-pattern-match strategy as opposed to the well-known seed-and-extend strategy employed by other tools. This study discusses an improved version of L1PD that shows how increasing the size of the k-mer probes from 50 to 75 or to 100 yields better results, as evidenced by experiments showing higher precision and recall when compared to the 50-mers. The probe-generation process was updated and the corresponding software is now shared so that users may generate probes for other reference genomes (with certain limitations). Additionally, L1PD was applied to other non-human genomes, such as dogs, horses, and cows, to further validate the pattern-matching strategy. The improved version of L1PD proves to be an efficient and promising approach for L1 detection.

1. Introduction

1.1. Transposable Elements and LINE-1s

Transposable elements (TEs) or Transposons are DNA sequences that move from one location in the genome to another. These elements, occupying about half of the human genome, play an important role in the evolution of genomes, influencing genetic variation and genomic stability. Due to the disruption they cause in the genome, they are linked to various diseases [1,2,3]. In humans, the Long Interspersed Element-1 (LINE-1 or L1) is the only active autonomous TE, accounting for 17% of the genome with more than 500,000 sequences. L1s are capable of mobilizing themselves as well as other non-autonomous TEs, such as Alu and SVA elements, using a “copy-and-paste” mechanism [4,5].

1.2. Research Areas Involving LINE-1s

One of the areas where L1s have been of high interest is cancer research, since L1s have been associated with varying forms of cancer [6,7]; in fact, more than 1000 articles focusing on L1s and cancer are available in the PubMed archive [8]. The protein encoded by the Open Reading Frame 1 of L1s (ORF1p), specifically, has been considered a biomarker of neoplasia [9]. ORF1p has also been found to be a candidate biomarker in high-grade serous ovarian carcinoma [10]. In short, ORF1p has shown promise as a multicancer biomarker with potential utility for disease detection and monitoring, including ovarian cancer, gastroesophageal cancer, and colorectal cancer [6,11]. Precisely for this reason, there have been attempts to inhibit ORF1p expression and L1s in general [3,12].
L1s have impacted other research areas besides cancer, including the following recent studies:
  • A study on mice by Song et al. showed that L1-induced hearing impairment could actually be reversed by deleting the L1 retrotransposon insertion [13].
  • A study by Tao et al. showed that L1 insertions can occur frequently at CRISPR/Cas9 editing sites [14].
  • A study by Takahashi et al. suggests L1 activation in the cerebellum may cause ataxia [15].
  • A study by Lou et al. suggests L1s may lead to early spontaneous abortion [16].
Hence, the impact of L1s is wide-reaching, which highlights the importance of further research to better understand how their presence and frequency may be used for disease diagnosis and/or prevention.

1.3. L1PD

Because of these adverse effects on health associated with L1s, accurate and efficient L1 detection is important, which is why we developed L1PD (LINE-1 Pattern Detection) [17]. Most L1s are inactive due to rearrangements, point mutations, and truncation [4,5], and that is why L1PD focuses on detecting full-length L1s, which are the most likely to retrotranspose at significant rates. There are several commonly used aligners that use the well-known seed-and-extend strategy, such as BWA-MEM [18], Bowtie2 [19], and CUSHAW2 [20], with certain differences in the seeding and extension techniques [21]. L1PD uses what we have termed seed-and-pattern-match, as opposed to seed-and-extend, where we replaced the heuristics of the extend component with a pattern-matching algorithm. Lopez et al. [17] discuss the shortcomings of seed-and-extend in the context of L1 detection.
The pattern matching is carried out by using a fixed pre-determined set of k-mer probes, generated based on the L1 database L1Base2 [22]. The probes are seeded into a target human genome and then particular patterns of matches are considered to be L1s if they meet certain criteria. Experiments were carried out with varying values for edit distance, distance threshold, and minimum amount of probes required in a pattern. Results were analyzed to determine which values maximized F1 score, and these values were then set to be the default values of L1PD, although the user is able to specify different values if they wish to favor either precision or recall.
Our current research focuses on improving L1PD’s performance by increasing the probe length and automating and improving the probe-generation process, thereby increasing the usefulness of the software and promoting its use in the scientific community to propel further L1 research. Additionally, we tested the effectiveness of L1PD with genomes of other species that are also available in L1Base2, thereby making it possible to detect L1s within other species as long as certain metadata are available.

2. Materials and Methods

2.1. Probe Generation

The seed-and-pattern-match approach established in L1PD [17] displayed promising results of precision and F1 score when matching the generated 50-mer probes back to the human genome. It was hypothesized that perhaps increasing k-mer length would yield better recall, and this hypothesis was reinforced when it was found that another study by Phan et al. had previously shown a remarkable improvement in recall, with mrFAST and other aligners, when k-mer size was increased from 50 to 75 or 100 [23]. The process of generating the probes will now be explained.
There is a pre-processing step to create a single FASTA file with all of the sequences from which probes are to be generated. First, all of the full-length intact L1s are downloaded from L1Base2 in FASTA format, along with the corresponding Comma-Separated Values (CSV) file containing the metadata for the L1s. The CSV file indicates the positions where the ORF1 and ORF2 reside, allowing us to extract them into two separate FASTA files, one with all of the ORF1s and another with all of the ORF2s. These files are processed separately, producing probes from ORF1 and from ORF2 that will need to be joined (manually) later onto a single file. Additionally, a reference genome is needed (e.g., GRCh38), since it will be used later to weed out probe candidates that do not meet certain criteria. See Figure 1 for a visualization of this pre-processing step.
Once the sequences (ORF1 or ORF2) from which the probes are to be generated are in a single file, the probe generator bash script is executed. First, the sequences provided as input are aligned with the multiple sequence alignment program chosen. The software currently provides support for Clustal Omega 1.2.0 [24], MUSCLE 3.8.31 [25], MAFFT 7.312 [26], and T-Coffee 11 [27]; the user may use whichever they prefer, as long as it is already installed on their system. This was one of the first improvements made to L1PD, since previously the sequences had been manually aligned with the aid of bioSyntax [28]. It should be noted that bioSyntax has continued to be an incredibly useful tool for viewing files on a remote server.
Once the sequences are aligned, the BioPython [29] module’s dumb_consensus function is used to extract k-mers from columns that meet the specified consensus threshold. Note that a high threshold is desirable for better results since this ensures that the probes are indeed good representatives of the L1s they aim to find. Previously, the consensus threshold (sometimes referred to as the “identity percentage” within L1PD files) had been hard coded, but now it is a parameter that may be specified by the user (95% is used by default). Originally it had been difficult to obtain a reasonable amount of probes with 95% identity when using larger values of k, but this was no longer an issue once the probe generation was updated to apply the identity percentage as a filtering criterion earlier in the process in order to reduce the number of pre-candidates.
The micro-read Fast Alignment Search Tool (mrFAST) [30] is then invoked to map these k-mers onto the reference genome. The results are analyzed to filter the probe candidates through the following refinement process:
  • Discard k-mers that did not get mapped to the ORF from which they were extracted.
  • From the set of remaining k-mers, discard those that did not map to every single L1.
  • From the set of remaining k-mers, sort in non-descending order of number of map hits on the genome, since a smaller number of hits will lower the number of false positives.
  • Following the order established in the previous step, select as probes the subset of all non-overlapping k-mers.
The resulting FASTA file will include the probes for the corresponding ORF that contained the original sequences. These probes are manually combined with the probes from the other ORF to produce the final probes file. Each comment line in the FASTA file will include some basic information, including the ORF from which it was extracted, the k-mer size, and its offset from the beginning of the ORF1 region. This offset is crucial for the success of L1PD, since it is used to determine whether a pattern of matches corresponds to an L1. Figure 2 provides a visualization of the probe-generation process.

2.2. Incorporating Non-Human Genomes

L1Base2 [22] is an online database that contains information for L1s not only for the human reference genome but for the reference genome of several other species as well, such as the Canis familiaris (dog) and Equus caballus (horse). Besides improving the performance of human genomes, one of the main goals for this updated version of L1PD was to be able to generate probes and detect L1s for these other species as well.
The code structure for the probe-generation process allows probes for other species to be generated without any additional effort. Once the appropriate files (FASTA and CSV) are downloaded from L1Base2, the process works seamlessly to extract the L1 components and then generate the probes. However, it should be noted that the amount of probes, and the time required to generate them, may vary considerably.

2.3. L1PD Algorithm

The L1PD algorithm [17] receives a fully-assembled genome and uses mrFAST [30] to index the genome and then map the probes against that genome, generating a SAM file. It should be noted that most of the run time of L1PD is taken by these steps. The SAM file, along with the probes, are then fed into a Python script (L1PD.py), which applies the pattern-matching algorithm and generates an output file in GFF3 (General Feature Format Version 3) format. To summarize, the pattern-matching algorithm consists of finding patterns of “hits” of probes (in the SAM file) that are in the same order and within the expected distance as specified by the FASTA file that contains the probes. The full algorithm is explained by López et al. [17], but Figure 3 provides an overall visualization of L1PD. Note that the edit distance, distance threshold, and min. amount of k-mers in a pattern are all parameters that may be specified by the user, although L1PD provides default values for certain species.
GFF3 files store genome information features in nine tab-delimited columns. Of the nine columns, L1PD fills the following seven:
  • sequence id (chromosome where LINE-1 was found);
  • source (“L1PD”);
  • type (“mobile_genetic_element”);
  • start (start position of the LINE-1);
  • end (end position of the LINE-1);
  • strand (“+” for forward strand and “−” for reverse strand);
  • attributes (“Name = LINE1”).
One of the main obstacles in extending L1PD for use with other species was the fact that not only do different species have different amounts of L1s, but the average L1 length also varies, as well as the average lengths of the main components (ORFs and UTRs). These lengths are important because they are used to calculate the start and end positions that will be written to the GFF3 file. Since the lengths are expected to be normally distributed, the updated version of L1PD now uses the CSV file to calculate the modes (most common value) of the length of each component, which are then used to calculate the start and end positions of the L1.
Since the previous version of L1PD focused only on human genomes, it also generated a histogram comparing the number of L1s in each chromosome of the GRCh38 vs. the provided target genome. However, this feature has been removed since now the user will be able to run L1PD with different species.

2.4. Precision and Recall

The precision and recall shell script was initially used as a private component but is now also being provided in the repository for public use. This component was created to confirm the validity of L1PD and also to test its sensitivity to changes in edit distance, distance threshold, and min. amount of probes required in a pattern. The aim was to apply L1PD to the reference genome with varying combinations of these parameters, calculating the precision, recall, and F1 score each time, and then use the values that maximized F1 score as default values for L1PD. It should be noted that the F1 score is the harmonic mean of precision and recall [31]. Additionally, tables with detailed results were provided [17] to help guide the user as to what parameters to change in case they wished to favor either precision or recall.
Another of the improvements to L1PD was changing the primary parameters (distance, threshold, and min. amount of probes) from being fixed to allowing the user to specify them on the command line, which is necessary since these values are expected to vary considerably with genomes of different species. This provides the user the flexibility to execute the code and run hundreds of experiments consecutively, exploring the results for different combinations of these parameters. These experiments are essential to understanding how well the probes are performing for that particular species.
Since L1Base2 [22] contains all of the relevant L1 information, it was used to calculate the total amount of L1s in humans, thereby allowing us to calculate the amount of true positives and false negatives in order to properly calculate the values for precision, recall, and F1 score. However, these values were fixed, so L1PD was updated so that the user may provide the directory where the corresponding L1Base2 CSV files are stored; that way, these calculations may still be performed for different species. Through testing, it was found that there were a few entries in the CSV files for the human genome that were empty. The code now gracefully disregards these special cases to allow the process to continue. Note that the CSV files must follow a particular naming convention; sample files will be provided in the repository.

3. Results

3.1. Results with Different k-mer Sizes

The algorithms were executed on the human genome using three different k-mer sizes (50, 75, 100) in order to assess the best outcome in F1 scores. For these experiments, the identity percentage stayed at 95%, as it displayed promising results, although this parameter can be adjusted when executing the algorithm. To test the effectiveness of the probes with the updated probe-generation algorithm, L1PD was executed using varying ranges of edit distance, distance threshold, and min. amount of k-mers in a pattern. For the 50-mer probes, the best F1 score was found with an edit distance of 15, whereas the 75-mer and 100-mer probes both peaked at an edit distance of 30. Table 1 summarizes these results; detailed results are available in Appendix A.1.
As can be seen, the 100-mers obtained the highest F1 scores. Both 75-mers and 100-mers showed relatively similar values once the edit distance reached at least 15, but 100-mers have the fewest probes of the three different k-mer sizes and require fewer k-mers per pattern, which was expected due to the increase in k-mer size. The overall highest case of F1 score was achieved with the 100-mers using edit distance 30, threshold 600, and a minimum of nine probes. For all three k-mer sizes, the results peaked and then started steadily decreasing. Figure 4 provides a visual comparison of the best F1 score for each edit distance and k-mer size.
Additionally, there was no change in time complexity when using larger probes, as shown in Table 2.

3.2. L1PD with Other L1Base2 Genomes

Given that L1PD is based on L1Base2 [22], its applicability extends to various genomes. We chose to apply it to species from different orders: Carnivora, represented by the dog (Canis Familiaris); Artiodactyla, represented by the cow (Bos taurus); and Perissodactyls, represented by the horse (Equus caballus). The reference genomes for these species were obtained from the Ensembl Release 84 FTP site (http://ftp.ensembl.org/pub/release-84/fasta/ (accessed on 31 December 2023), ensuring that the genomes used are the same versions as those employed by L1Base2. Additionally, other files, such as the L1 sequences and metadata files, were obtained from L1Base2 in the same manner as with the human genome.
Using the same process as for the human genome, probes were generated for each species, with three different k-mer sizes (50, 75, and 100) and using a consensus threshold of 95%. Once the probes had been generated, L1PD was then executed with no changes to the code, but this time the arguments were the corresponding files for the species being targeted. The precision and recall experiments were then carried out to determine the values of k-mer size, edit distance, distance threshold, and minimum amount of probes per pattern that maximized the F1 score. The summarized results for dog and horse are presented in Table 3 and Table 4, respectively. Tables with more detailed results can be found in the appendix (Appendix A.2 and Appendix A.3). Additionally, Figure 5 and Figure 6 provide a visual comparison of the best F1 Score for each edit distance and k-mer size for the dog and cow genomes, respectively.
Although good results were obtained with the cow genome using 50-mers and an identity percentage of 95%, probes for ORF1 could not be generated with a k-mer size of 75. If we consider the other species (human, dog, and horse), it is a common trend for the total number of probes to reduce as the k-mer size increases. However, there were only five 50-mer probes for the cow, so perhaps to obtain 75-mer probes, the identity percentage might need to be reduced. Due to this setback and pending further analysis, L1PD was executed exclusively with probes of size 50 for the cow. Nonetheless, the results for the cow were the most promising, yielding the highest F1 score of all when using 50-mers (0.66801). Summarized data for the cow are presented in Table 5; more detailed results can be found in the appendix (Appendix A.4).
As with the human genome, the time required to execute L1PD was not affected significantly by using larger probes, as shown in Table 6.

4. Discussion

The work realized by Phan et al. [23] established that the probe k-mer size and recall score have a positive correlation, meaning that increasing the k-mer size should result in an increase in recall. The initial 50-mer experiments were thus adjusted to include 75-mers and 100-mers following the hypothesis that increasing the probe length will directly improve the pattern-matching recall. The results support the hypothesis previously established, although the difference was not as significant as expected. Once adequate values of edit distance, distance threshold, and minimum probes were met, the recall gradually increased for larger k-mer sizes. In addition, the precision increased as well (when compared to 50-mers), resulting in an overall higher F1 score. The overall best results with humans were found with 100-mers, using an edit distance of 30, a distance threshold of 600, and a minimum of 9 probes per pattern, for an F1 score of 0.72931.
Increasing edit distance provides flexibility in finding probes that may have had more changes due to rearrangements or mutations, which is why the lowest edit distance always provided the worst results. Similarly, a higher distance threshold allows for probe hits that may be further (or closer) away than originally expected, due to insertions (or deletions) that may have occurred. As anticipated, increasing recall can have the undesirable side effect of decreasing precision, and this can be seen in certain cases when comparing results from 100-mers with results of 75-mers.
On the other hand, it follows that increasing k-mer size would provide better results. DNA contains highly repetitive sequences, so a larger k-mer size can reduce how often these common sequences will be matched and more accurate results can be obtained.

4.1. Advantages and Limitations

4.1.1. Advantages

One of the main reasons L1PD was originally developed was because the seed-and-extend strategy seemed ill-fitted for the task of L1 detection [17]. L1PD removes the heuristics associated with the extended phase and replaces it with pattern matching, making it a novel and promising approach for L1 detection.
The probe-generation scripts can be executed with other genomes annotated in L1Base2, allowing researchers to generate probes for other genomes. These probes, along with the corresponding CSV files, can then be used by L1PD to detect L1s.
Because of the narrow focus of L1PD, users are able to execute it in “Genome mode” by providing very little input. By default, L1PD uses a previously generated set of probes and the values of edit distance, distance threshold, and minimum amount of k-mers that resulted in optimal F1 scores (see Appendix A for the experimental results). Hence, the user only needs to provide the genome (in FASTA format) in which L1s are to be detected and to specify the species of that genome. L1PD also allows for the user to specify custom values of any of these parameters, in case they wish to favor either precision or recall.

4.1.2. Duplicate Matches

Through experimentation, it was found that excessively high values of edit distance and/or distance threshold generated “duplicate” matches, meaning that certain patterns were matched to more than one L1. However, this did not affect the results since the highest F1 score was obtained with lower edit distance and threshold values that did not exhibit this behavior. In the future, additional experiments could be carried out to get a better understanding of the values of the parameters that cause these duplicates to start creeping in. This might help to limit upper boundaries for edit distance and/or threshold since currently the upper bounds are determined experimentally. For example, although for 50-mers we carried the experiments through a maximum edit distance of at least 25, this might be too high since it could mean that half of the k-mer has been changed.

4.1.3. CSV File Required

The original version of L1PD was designed specifically for humans, so there was certain information that was hard coded into the software, such as the average ORF1 and ORF2 lengths. This information can no longer be hard coded since L1PD can now be easily used with different genomes available from L1Base2. Due to this, L1PD now requires an additional argument with the path to a CSV file containing the metadata for the reference genome of that species. These files must follow the format of the CSV files available from L1Base2, and L1PD will use that information to determine the mean length of the different L1 components.
The long-term goal is for researchers to be able to use L1PD with other species they are interested in. However, this will require the CSV file with the metadata to be created, which is a limitation in applying L1PD to different species. One possible workaround is to avoid GFF3 format and/or use an annotation format that does not require the start and end position of every L1.

4.1.4. Time Required for Generating Probes

When we began applying L1PD to other species from L1Base2, the intention was to execute it with at least one representative from each order available on the same webpage. For rodents, the mouse (Mus musculus) was chosen as the representative. However, when attempting to execute the steps for generating ORF2 probes with a k-mer size of 50, the process had already been running for more than 20 h, significantly longer than all the previous species. We believe this extended execution time may be due to the substantial number of full-length intact L1s in the mouse, totaling 2811; much higher than the second-highest species we worked with, the dog, which only had 264 full-length intact L1s. Upon surpassing the 20-h mark, we decided to stop the execution of the script, setting a future goal of optimizing the code used in probe generation. Besides analyzing the code structure, one of the ideas for optimization we implemented is to allow the user to specify the number of threads used by the alignment program, provided that the program supports such an option. However, mrFAST [30] currently does not support multithreading.

4.1.5. Finding Appropriate Values of k

Finally, as seen with the cow genome, 100-mers are not always feasible. On the other hand, it is possible that certain genomes could use longer k-mers, so it requires a bit of experimentation for the user to determine what size k-mer to use. Work has been started on polishing the probe-generator component so that it may be used to find optimal values of k, thereby saving time and promoting the use of adequately sized k-mers.

4.2. Applications

As with the original version [17], the updated version of L1PD can be used to calculate Copy Number Variation by analyzing the change in L1 count with respect to the reference genome.
Since L1PD can now be applied to several genomes, the histogram feature has been temporarily disabled, but perhaps it can be enabled in the future to get a visual comparison of the L1 distribution in the user-provided genome in comparison to the reference genome. Nonetheless, L1PD can still be executed in three different modes, allowing its use even when only reads in FASTQ format are available. See Figure 7 and Lopez et al. [17] for more details.
The pattern-matching strategy lends itself to be applied to different families of sequences that have column ranges of high similarity. This is the reason why the methodology was thoroughly explained and the code is made freely available. We plan on expanding in this area in the near future and possibly collaborating with colleagues who work with different species and/or families of sequences.

4.3. Future Work

Future work includes optimizing the code, finding the largest possible k-mer size for humans that yields the best results in terms of maximizing recall, trying to apply the pattern-matching algorithm to other families of sequences, such as SINEs (Short Interspersed Nuclear Elements), trying to make L1PD easier to use with genomes not available from L1Base2, and polishing the probe generator component to be used as a standalone tool.

5. Conclusions

L1PD proves to be an efficient and promising approach for L1 detection through its seed-and-pattern-match approach. Increasing k-mer size from 50 to 100 improved precision and recall, and L1PD works adequately, although with lower-than-expected results, with genomes from other species available from L1Base2. By improving L1PD’s performance and usefulness, as well as that of the probe-generation algorithm, we hope to help propel further L1 research.

Author Contributions

Conceptualization, J.O.L.; data curation, J.L.Q. and E.D.M.; funding acquisition, J.O.L.; investigation, J.O.L., J.L.Q. and E.D.M.; methodology, J.O.L.; project administration, J.O.L.; software, J.L.Q. and E.D.M.; supervision, J.O.L.; validation, J.L.Q. and E.D.M.; visualization, J.L.Q. and E.D.M.; writing—original draft preparation, J.O.L., J.L.Q. and E.D.M.; writing—review and editing, J.O.L. All authors have read and agreed to the published version of the manuscript.

Funding

The APC was funded by the University of Puerto Rico at Arecibo.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code for L1PD, as well for the probe generation and precision and recall processes, is available at https://github.com/juan-lopez/L1PD (accessed on 31 December 2023). The code consists of several shell scripts, Python scripts, FASTA files with the probes, as well as sample output files. The shell scripts should run under most Unix-like systems. L1PD may be executed in one of three modes: 1. Genome mode; 2. BAM/CRAM mode; 3. FASTQ mode. BAM/CRAM mode automatically invokes Genome mode, and FASTQ mode automatically invokes BAM/CRAM mode, as shown in Figure 7.

Acknowledgments

The High Performance Computing Facility of the University of Puerto Rico is where this improved version of L1PD was written and where all of the corresponding experiments were carried out. Thanks to the University of Puerto Rico and the Institutional Development Award (IDeA) INBRE grant P20 GM103475 from the National Institute for General Medical Sciences (NIGMS), a component of the National Institutes of Health (NIH) and the Bioinformatics Research Core of INBRE.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of this study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
GFF3General Feature Format Version 3
L1, LINE-1Long Interspersed Element-1
L1PDLINE-1 Pattern Detection
ORFOpen Reading Frame
SINEShort interspersed element
SVASINE, VNTR, and Alu
UTRUntranslated Region

Appendix A. Detailed Experimental Results

Appendix A.1. Detailed L1PD Results for Homo sapiens (human)

Table A1. F1 score: Homo sapiens (human), k-mer size: 50, total num. of probes: 43.
Table A1. F1 score: Homo sapiens (human), k-mer size: 50, total num. of probes: 43.
Edit DistanceThresholdMinimum ProbesPrecisionRecallF1 score
5625240.790590.505370.61659
650240.790660.505590.61677
675240.790440.505740.61682
700240.789940.505880.61676
725240.789400.505880.61660
10600240.767620.591760.66831
625240.767430.592560.66874
650240.767130.592780.66877
675240.766470.593000.66866
700240.766030.593220.66863
600240.763120.597390.67016
625240.762940.598200.67059
15650240.762650.598420.67062
675240.762030.598710.67056
700240.761600.598930.67054
20625250.778210.588030.66988
650250.777870.588390.66998
675250.777230.588760.66999
700250.776590.588910.66984
725250.776020.588980.66968
25625250.777350.588690.66998
650250.777010.589050.67009
675250.776370.589420.67009
700250.775740.589560.66995
725250.775140.589560.66972
30625250.777800.588690.67015
650250.777460.589050.67026
675250.776820.589420.67026
700250.776190.589560.67012
725250.775590.589560.66989
Values in bold indicate the parameters that resulted in the best F1 Score for each edit distance. The row highlighted in cyan represents the highest F1 Score obtained across the entire species.
Table A2. F1 score: Homo sapiens (human), k-mer size: 75, total num. of probes: 22.
Table A2. F1 score: Homo sapiens (human), k-mer size: 75, total num. of probes: 22.
Edit DistanceThresholdMinimum ProbesPrecisionRecallF1 Score
5650120.804000.440200.56891
675120.803570.440200.56880
700120.803440.440420.56895
725120.803010.440420.56884
750120.802530.440560.56884
10575140.794040.590810.67751
600140.794070.591760.67814
625140.793800.592490.67852
650140.792990.592640.67832
675140.792520.592640.67815
15575170.895540.576320.70130
600170.894480.577280.70169
625170.893380.578010.70189
650170.892410.578230.70175
675170.891520.578300.70153
20575170.888580.592120.71067
600170.887740.592930.71098
625170.886510.593660.71111
650170.885290.593880.71087
675170.884320.593880.71056
25575170.883270.597240.71261
600170.882160.597980.71278
625170.880860.598710.71287
650170.879380.598930.71255
675170.878250.598930.71218
575170.879750.599370.71298
600170.878650.600100.71313
30625170.877470.600830.71326
650170.876210.601120.71304
675170.875090.601120.71267
35575170.874920.600680.71231
600170.873750.601410.71243
625170.872580.602150.71256
650170.871340.602440.71235
675170.870160.602510.71201
40575170.874730.600680.71224
600170.873650.601410.71240
625170.872310.602150.71247
650170.871160.602440.71229
675170.869980.602510.71195
Values in bold indicate the parameters that resulted in the best F1 Score for each edit distance. The row highlighted in cyan represents the highest F1 Score obtained across the entire species.
Table A3. F1 Score: Homo sapiens (human), k-mer size: 100, total num. of probes: 13.
Table A3. F1 Score: Homo sapiens (human), k-mer size: 100, total num. of probes: 13.
Edit DistanceThresholdMinimum ProbesPrecisionRecallF1 Score
552590.954440.271230.42241
55090.954460.271370.42258
57590.954480.271520.42277
60090.954770.271810.42315
62590.955030.271880.42325
1057570.769330.542090.63602
60070.769670.542900.63668
62570.769940.543480.63718
65070.769510.543630.63713
67570.769160.543770.63711
1557590.892320.584370.70623
60090.891610.585470.70681
62590.890720.586130.70701
65090.889480.586350.70678
67590.888530.586570.70664
2057590.883550.609970.72169
60090.882430.611070.72209
62590.881240.611730.72215
65090.879810.612020.72187
67590.878710.612090.72155
2555090.876790.619990.72635
57590.876920.621240.72725
60090.875380.622260.72742
62590.874060.622920.72741
65090.872600.623290.72716
55090.869700.626430.72828
57590.869930.627670.72920
3060090.868260.628700.72931
62590.866810.629360.72924
65090.865310.629720.72894
3552590.864480.628550.72786
55090.864050.629500.72835
57590.864280.630750.72926
60090.862390.631700.72923
62590.861050.632360.72919
4052590.859190.630670.72740
55090.858860.631620.72791
57590.858920.632870.72876
60090.857150.633820.72875
62590.855670.634480.72865
Values in bold indicate the parameters that resulted in the best F1 Score for each edit distance. The row highlighted in cyan represents the highest F1 Score obtained across the entire species.

Appendix A.2. Detailed L1PD Results for Canis Familiaris (Dog)

Table A4. F1 score: Canis Familiaris (dog), k-mer size: 50, total num. of probes: 24.
Table A4. F1 score: Canis Familiaris (dog), k-mer size: 50, total num. of probes: 24.
Edit DistanceThresholdMinimum ProbesPrecisionRecallF1 Score
570050.584940.451870.50986
72550.585050.452070.51002
75050.585410.452170.51022
77550.584670.452070.50988
80050.584650.452170.50994
675100.652090.451570.53360
700100.651960.451870.53377
10725100.652200.452170.53406
750100.651970.452270.53405
775100.651400.452070.53372
15700100.641760.455680.53294
725100.641970.456080.53328
750100.641830.456180.53331
775100.641280.455980.53298
800100.641060.456080.53296
20700100.633580.456980.53097
725100.633520.457380.53122
750100.633310.457480.53122
775100.632680.457280.53086
800100.632460.457380.53084
25675110.650640.447760.53046
700110.650450.447960.53053
725110.650700.448260.53082
750110.650370.448360.53078
775110.649890.448160.53048
Values in bold indicate the parameters that resulted in the best F1 Score for each edit distance. The row highlighted in cyan represents the highest F1 Score obtained across the entire species.
Table A5. F1 Score: Canis Familiaris (dog), k-mer size: 75, total num. of probes: 14.
Table A5. F1 Score: Canis Familiaris (dog), k-mer size: 75, total num. of probes: 14.
Edit DistanceThresholdMinimum ProbesPrecisionRecallF1 Score
585020.579560.395820.47038
87520.579480.395820.47034
90020.579560.395820.47038
92520.579480.395820.47034
95020.579310.395820.47029
1070060.651590.446460.52985
72560.651650.446760.53009
75060.651510.446860.53011
77560.651460.446760.53003
80060.651220.446860.53001
1567590.716420.445050.54902
70090.716330.445350.54923
72590.715890.445650.54933
75090.715480.445750.54928
77590.714970.445650.54905
67590.706080.452570.55159
70090.706000.452870.55178
2072590.705520.453270.55193
75090.704910.453370.55182
77590.704420.453270.55159
2567590.697060.455180.55072
70090.696990.455480.55092
72590.696850.455880.55116
75090.696260.455980.55106
77590.695790.455880.55084
3067590.692680.456480.55029
70090.692610.456780.55050
72590.692480.457180.55074
75090.691790.457280.55060
77590.691430.457180.55041
Values in bold indicate the parameters that resulted in the best F1 Score for each edit distance. The row highlighted in cyan represents the highest F1 Score obtained across the entire species.
Table A6. F1 score: Canis Familiaris (dog), k-mer size: 100, total num. of probes: 10.
Table A6. F1 score: Canis Familiaris (dog), k-mer size: 100, total num. of probes: 10.
Edit DistanceThresholdMinimum ProbesPrecisionRecallF1 Score
577520.656630.328050.43751
80020.656830.328150.43764
82520.656900.328250.43775
85020.656830.328150.43764
87520.656830.328150.43764
1077520.581000.442240.50220
80020.581130.442350.50232
82520.581190.442450.50241
85020.581360.442350.50240
87520.581100.442140.50217
1570060.686820.442850.53848
72560.686650.443150.53865
75060.686490.443250.53868
77560.686270.443250.53860
80060.686210.443350.53866
2067570.704570.452870.55134
70070.704440.453070.55145
72570.703850.453470.55156
75070.703570.453570.55155
77570.702930.453370.55121
2567570.695810.459590.55354
70070.695580.459790.55362
72570.695030.460190.55373
75070.694760.460290.55372
77570.694250.460090.55341
3060080.737200.444650.55471
62580.736920.445150.55501
65080.736530.445650.55529
67580.735800.445650.55508
70080.735070.445650.55488
3560080.730610.447860.55531
62580.730310.448260.55552
65080.729940.448760.55580
67580.729340.448760.55563
70080.728630.448760.55542
60080.726100.449460.55522
62580.726050.449860.55551
4065080.725610.450470.55585
67580.725020.450470.55567
70080.724370.450570.55556
4560080.720760.450570.55449
62580.720710.450970.55478
65080.720290.451570.55511
67580.719710.451570.55493
70080.719300.451670.55489
5060080.716710.452270.55457
62580.716660.452670.55486
65080.716140.453270.55515
67580.715570.453270.55498
70080.715160.453370.55493
Values in bold indicate the parameters that resulted in the best F1 Score for each edit distance. The row highlighted in cyan represents the highest F1 Score obtained across the entire species.

Appendix A.3. Detailed L1PD Results for Equus caballus (horse)

Table A7. F1 score: Equus caballus (horse), k-mer size: 50, total num. of probes: 35.
Table A7. F1 score: Equus caballus (horse), k-mer size: 50, total num. of probes: 35.
Edit DistanceThresholdMinimum ProbesPrecisionRecallF1 Score
5132520.301460.362320.32908
135020.301460.362320.32908
137520.301700.362610.32934
140020.301660.362610.32932
142520.301660.362610.32932
120080.353670.400430.37560
122580.353960.400870.37595
10125080.354050.401010.37606
127580.354000.401010.37603
130080.353960.401010.37601
151325100.367120.376580.37178
1350100.366760.376580.37159
1375100.367040.376870.37188
1400100.366940.376870.37182
1425100.366890.376870.37180
201200100.360800.378900.36962
1225100.361240.379630.37019
1250100.361230.379780.37026
1275100.361080.379780.37019
1300100.360980.379780.37013
251175100.356400.379780.36770
1200100.356570.380070.36794
1225100.357060.380800.36853
1250100.356900.380940.36852
1275100.356560.380940.36833
Values in bold indicate the parameters that resulted in the best F1 Score for each edit distance. The row highlighted in cyan represents the highest F1 Score obtained across the entire species.
Table A8. F1 score: Equus caballus (horse), k-mer size: 75, total num. of probes: 13.
Table A8. F1 score: Equus caballus (horse), k-mer size: 75, total num. of probes: 13.
Edit DistanceThresholdMinimum ProbesPrecisionRecallF1 Score
5152520.431800.130320.20020
155020.431800.130320.20020
157520.431800.130320.20020
160020.431320.130180.19998
162520.431320.130180.19998
10147520.323110.383410.35067
150020.323240.383560.35082
152520.323240.383560.35082
155020.323120.383560.35075
157520.323160.383560.35077
15150050.367740.389090.37810
152550.367740.389090.37810
155050.367740.389090.37810
157550.367690.389090.37807
160050.367640.389090.37806
20150060.382200.386030.38410
152560.382200.386030.38410
155060.382200.386030.38410
157560.382140.386030.38406
160060.382090.386030.38404
150060.375590.393450.38431
152560.375590.393450.38431
25155060.375590.393450.38431
157560.375480.393450.38424
160060.375430.393450.38422
30150060.370460.394760.38221
152560.370460.394760.38221
155060.370460.394760.38221
157560.370300.394760.38212
160060.370300.394760.38212
Values in bold indicate the parameters that resulted in the best F1 Score for each edit distance. The row highlighted in cyan represents the highest F1 Score obtained across the entire species.
Table A9. F1 score: Equus caballus (horse), k-mer size: 100, total num. of probes: 9.
Table A9. F1 score: Equus caballus (horse), k-mer size: 100, total num. of probes: 9.
Edit DistanceThresholdMinimum ProbesPrecisionRecallF1 Score
5152520.486680.116940.18856
155020.486680.116940.18856
157520.486680.116940.18856
160020.486070.116800.18833
162520.486070.116800.18833
10105020.451060.190400.26776
107520.451410.190540.26796
110020.451750.190690.26816
112520.451600.190690.26814
115020.451600.190690.26814
15120020.361130.377300.36902
122520.361540.377740.36945
125020.361630.377890.36957
127520.361530.377890.36951
130020.361480.377890.36949
2090040.382400.398250.39016
92540.382540.398400.39029
95040.382820.398690.39059
97540.382760.398690.39055
100040.382760.398690.39055
25122550.395380.394180.39477
125050.395530.394320.39492
127550.395530.394320.39492
130050.395470.394320.39488
132550.395420.394320.39486
145050.388710.403920.39616
147550.388800.404070.39628
30150050.388880.404210.39638
152550.388830.404210.39636
155050.388830.404210.39636
35120050.382820.406540.39431
122550.383180.406980.39471
125050.383310.407120.39484
127550.383260.407120.39482
130050.383260.407120.39482
40120050.379510.408430.39343
122550.379860.408870.39382
125050.380000.409010.39396
127550.379940.409010.39392
130050.379940.409010.39392
45120050.374050.410180.39127
122550.374400.410610.39166
125050.374430.410760.39175
127550.374330.410760.39168
130050.374230.410760.39163
50132550.370910.411490.39014
135050.370950.411630.39022
137550.370840.411780.39022
140050.370590.411780.39009
142550.370540.411780.39007
Values in bold indicate the parameters that resulted in the best F1 Score for each edit distance. The row highlighted in cyan represents the highest F1 Score obtained across the entire species.

Appendix A.4. Detailed L1PD Results for Bos taurus (Cow)

Table A10. F1 score: Bos taurus (cow), k-mer size: 50, total num. of probes: 5.
Table A10. F1 score: Bos taurus (cow), k-mer size: 50, total num. of probes: 5.
Edit DistanceThresholdMinimum ProbesPrecisionRecallF1 Score
585040.600570.622800.61147
87540.600310.624130.61198
90040.600630.625450.61278
92540.599870.625450.61238
95040.599240.625780.61221
1085050.878280.520000.65323
87550.876520.520990.65352
90050.876660.521650.65408
92550.875130.521320.65339
95050.873410.522310.65369
1585050.875880.534210.66364
87550.874190.535200.66392
90050.874320.535860.66446
92550.872910.535860.66406
95050.870700.536520.66392
2085050.868650.537850.66434
87550.866090.538840.66434
90050.865780.539500.66475
92550.864400.539500.66434
95050.862260.540160.66421
85050.875530.539500.66761
87550.872930.540490.66760
2590050.872600.541150.66801
92550.871200.541150.66760
95050.869030.541810.66746
Values in bold indicate the parameters that resulted in the best F1 Score for each edit distance. The row highlighted in cyan represents the highest F1 Score obtained across the entire species.

References

  1. Belancio, V.P.; Deininger, P.L.; Roy-Engel, A.M. LINE dancing in the human genome: Transposable elements and disease. Genome Med. 2009, 1, 97. [Google Scholar] [CrossRef]
  2. Solyom, S.; Kazazian, H.H. Mobile elements in the human genome: Implications for disease. Genome Med. 2012, 4, 12. [Google Scholar] [CrossRef]
  3. Zhang, X.; Zhang, R.; Yu, J. New Understanding of the Relevant Role of LINE-1 Retrotransposition in Human Disease and Immune Modulation. Front. Cell Dev. Biol. 2020, 8, 657. [Google Scholar] [CrossRef]
  4. Kazazian, H.H., Jr.; Moran, J.V. The impact of L1 retrotransposons on the human genome. Nat. Genet. 1998, 19, 19–24. [Google Scholar] [CrossRef]
  5. Hancks, D.C.; Kazazian, H.H., Jr. Roles for retrotransposon insertions in human disease. Mob. DNA 2016, 7, 9. [Google Scholar] [CrossRef]
  6. Rodić, N.; Sharma, R.; Sharma, R.; Zampella, J.; Dai, L.; Taylor, M.S.; Hruban, R.H.; Iacobuzio-Donahue, C.A.; Maitra, A.; Torbenson, M.S.; et al. Long Interspersed Element-1 Protein Expression Is a Hallmark of Many Human Cancers. Am. J. Pathol. 2014, 184, 1280–1286. [Google Scholar] [CrossRef]
  7. Lu, X.-J.; Xue, H.-Y.; Qi, X.; Jiang, X.; Ma, S.J. LINE-1 in cancer: Multifaceted functions and potential clinical implications. Genet. Med. 2016, 18, 431–439. [Google Scholar] [CrossRef]
  8. Rodić, N. LINE-1 activity and regulation in cancer. Front. Biosci.-Landmark 2018, 23, 1680–1686. [Google Scholar] [CrossRef]
  9. Ardeljan, D.; Taylor, M.S.; Ting, D.T.; Burns, K.H. The Human Long Interspersed Element-1 Retrotransposon: An Emerging Biomarker of Neoplasia. Clin. Chem. 2017, 63, 816–822. [Google Scholar] [CrossRef]
  10. Sato, S.; Gillette, M.; de Santiago, P.R.; Kuhn, E.; Burgess, M.; Doucette, K.; Feng, Y.; Mendez-Dorantes, C.; Ippoliti, P.J.; Hobday, S.; et al. LINE-1 ORF1p as a candidate biomarker in high grade serous ovarian carcinoma. Sci. Rep. 2023, 13, 1537. [Google Scholar] [CrossRef] [PubMed]
  11. Taylor, M.S.; Wu, C.; Fridy, P.C.; Zhang, S.J.; Senussi, Y.; Wolters, J.C.; Cajuso, T.; Cheng, W.C.; Heaps, J.D.; Miller, B.D.; et al. Ultrasensitive Detection of Circulating LINE-1 ORF1p as a Specific Multicancer Biomarker. Cancer Discov. 2023, 13, 2532–2547. [Google Scholar] [CrossRef]
  12. Kou, Y.; Wang, S.; Ma, Y.; Zhang, N.; Zhang, Z.; Liu, Q.; Mao, Y.; Zhou, R.; Yi, D.; Ma, L.; et al. A High Throughput Cell-Based Screen Assay for LINE-1 ORF1p Expression Inhibitors Using the In-Cell Western Technique. Front. Pharmacol. 2022, 13, 881938. [Google Scholar] [CrossRef]
  13. Song, C.; Li, J.; Liu, S.; Hou, H.; Zhu, T.; Chen, J.; Liu, L.; Jia, Y.; Xiong, W. An L1 retrotransposon insertion–induced deafness mouse model for studying the development and function of the cochlear stria vascularis. Proc. Natl. Acad. Sci. USA 2021, 118, e2107933118. [Google Scholar] [CrossRef]
  14. Tao, J.; Wang, Q.; Mendez-Dorantes, C.; Burns, K.H.; Chiarle, R. Frequency and mechanisms of LINE-1 retrotransposon insertions at CRISPR/Cas9 sites. Nat. Commun. 2022, 13, 3685. [Google Scholar] [CrossRef]
  15. Takahashi, T.; Stoiljkovic, M.; Song, E.; Gao, X.B.; Yasumoto, Y.; Kudo, E.; Carvalho, F.; Kong, Y.; Park, A.; Shanabrough, M.; et al. LINE-1 activation in the cerebellum drives ataxia. Neuron 2022, 110, 3278–3287.e8. [Google Scholar] [CrossRef]
  16. Lou, C.; Qiang, R.; Wu, H.; Zhang, L.; Li, W.; Jia, T.; Liu, X. Expression of LINE-1 retrotransposon in early human spontaneous abortion tissues. Medicine 2022, 101, e31964. [Google Scholar] [CrossRef]
  17. López, J.O.; Seguel, J.; Chamorro, A.; Ramos, K.S. Pattern matching for high precision detection of LINE-1s in human genomes. BMC Bioinform. 2022, 23, 375. [Google Scholar] [CrossRef]
  18. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv 2013. [Google Scholar] [CrossRef]
  19. Langmead, B.; Salzberg, S.L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 2012, 9, 357–359. [Google Scholar] [CrossRef]
  20. Liu, Y.; Schmidt, B. Long read alignment based on maximal exact match seeds. Bioinformatics 2012, 28, i318–i324. [Google Scholar] [CrossRef]
  21. Ahmed, N.; Bertels, K.; Al-Ars, Z. A comparison of seed-and-extend techniques in modern DNA read alignment algorithms. In Proceedings of the 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Shenzhen, China, 15–18 December 2016; pp. 1421–1428. [Google Scholar] [CrossRef]
  22. Penzkofer, T.; Jäger, M.; Figlerowicz, M.; Badge, R.; Mundlos, S.; Robinson, P.N.; Zemojtel, T. L1Base 2: More retrotransposition-active LINE-1s, more mammalian genomes. Nucleic Acids Res. 2016, 45, D68–D73. [Google Scholar] [CrossRef]
  23. Phan, V.; Gao, S.; Tran, Q.; Vo, N.S. How genome complexity can explain the difficulty of aligning reads to genomes. BMC Bioinform. 2015, 16, S3. [Google Scholar] [CrossRef]
  24. Sievers, F.; Wilm, A.; Dineen, D.; Gibson, T.J.; Karplus, K.; Li, W.; Lopez, R.; McWilliam, H.; Remmert, M.; Söding, J.; et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 2011, 7, 539. [Google Scholar] [CrossRef]
  25. Edgar, R.C. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004, 32, 1792–1797. [Google Scholar] [CrossRef]
  26. Katoh, K.; Standley, D.M. MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Mol. Biol. Evol. 2013, 30, 772–780. [Google Scholar] [CrossRef]
  27. Notredame, C.; Higgins, D.G.; Heringa, J. T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment. J. Mol. Biol. 2000, 302, 205–217. [Google Scholar] [CrossRef]
  28. Babaian, A.; Ebou, A.; Fegen, A.; Kam, H.Y.; Novakovsky, G.E.; Wong, J.; Aïssi, D.; Yao, L. bioSyntax: Syntax highlighting for computational biology. BMC Bioinform. 2018, 19, 303. [Google Scholar] [CrossRef]
  29. Cock, P.J.A.; Antao, T.; Chang, J.T.; Chapman, B.A.; Cox, C.J.; Dalke, A.; Friedberg, I.; Hamelryck, T.; Kauff, F.; Wilczynski, B.; et al. Biopython: Freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009, 25, 1422–1423. [Google Scholar] [CrossRef]
  30. Alkan, C.; Kidd, J.; Marques-Bonet, T.; Aksay, G.; Antonacci, F.; Hormozdiari, F.; Kitzman, J.O.; Baker, C.; Malig, M.; Mutlu, O.; et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nat. Genet. 2009, 41, 1061–1067. [Google Scholar] [CrossRef] [PubMed]
  31. van Rijsbergen, C.J. Evaluation. In Information Retrieval, 2nd ed.; Butterworth-Heinemann: Glasgow, Scotland, 1979; pp. 112–140. [Google Scholar]
Figure 1. Extracting L1 components into separate files.
Figure 1. Extracting L1 components into separate files.
Biology 13 00236 g001
Figure 2. Probe-generation process.
Figure 2. Probe-generation process.
Biology 13 00236 g002
Figure 3. L1PD algorithm.
Figure 3. L1PD algorithm.
Biology 13 00236 g003
Figure 4. Best F1 scores for each edit distance and k-mer size in human genome.
Figure 4. Best F1 scores for each edit distance and k-mer size in human genome.
Biology 13 00236 g004
Figure 5. Best F1 scores for each edit distance and k-mer size in dog genome.
Figure 5. Best F1 scores for each edit distance and k-mer size in dog genome.
Biology 13 00236 g005
Figure 6. Best F1 scores for each edit distance and k-mer size in horse genome.
Figure 6. Best F1 scores for each edit distance and k-mer size in horse genome.
Biology 13 00236 g006
Figure 7. L1PD mode flowchart.
Figure 7. L1PD mode flowchart.
Biology 13 00236 g007
Table 1. Best F1 score by edit distance and k-mer size: Homo sapiens (human).
Table 1. Best F1 score by edit distance and k-mer size: Homo sapiens (human).
Edit Distancek-mer SizeThresholdMinimum ProbesPrecisionRecallF1 Score
50675240.790440.505740.61682
575700120.803440.440420.56895
10057590.954480.271520.42277
50650240.767130.592780.66877
1075625140.793800.592490.67852
10062570.769940.543480.63718
50650240.762650.598420.67062
1575625170.893380.578010.70189
10062590.890720.586130.70701
50675250.777230.588760.66999
2075625170.886510.593660.71111
10062590.881240.611730.72215
50675250.776370.589420.67009
2575625170.880860.598710.71287
10060090.875380.622260.72742
50675250.776820.589420.67026
3075625170.877470.600830.71326
10060090.868260.628700.72931
Values in bold indicate the parameters that resulted in the best F1 Score for each edit distance. The row highlighted in cyan represents the highest F1 Score obtained across the entire species.
Table 2. Time required by L1PD with different k-mer sizes with default values: Homo sapiens (human).
Table 2. Time required by L1PD with different k-mer sizes with default values: Homo sapiens (human).
k-mer SizeEdit DistanceDistance ThresholdMinimum k-mers RequiredTotal Amount of k-mersTime
5015625183444 min, 33 s
7530625172244 min, 28 s
1003060091344 min, 29 s
Table 3. Best F1 score by edit distance and k-mer size: Canis Familiaris (dog).
Table 3. Best F1 score by edit distance and k-mer size: Canis Familiaris (dog).
Edit Distancek-mer SizeThresholdMinimum ProbesPrecisionRecallF1 score
5075050.585410.452170.51022
57590020.579460.395820.47038
10082520.656900.328250.43775
50725100.652200.452170.53406
107575060.651510.446860.53011
10082520.581190.442450.50241
50750100.641830.456180.53331
157572590.715890.445650.54933
10075060.686490.443250.53868
5075090.633310.457480.53122
207572590.705520.453270.55193
10072570.703850.453470.55156
50725110.650700.448260.53082
257572590.696850.455880.55116
10072570.695030.460190.55373
Values in bold indicate the parameters that resulted in the best F1 Score for each edit distance. The row highlighted in cyan represents the highest F1 Score obtained across the entire species.
Table 4. Best F1 score by edit distance and k-mer size: Equus caballus (horse).
Table 4. Best F1 score by edit distance and k-mer size: Equus caballus (horse).
Edit Distancek-mer SizeThresholdMinimum ProbesPrecisionRecallF1 Score
50137520.328470.354320.34089
575157520.431800.130320.20020
100157520.486680.116940.18856
50137570.351650.414980.38096
1075152520.323240.383560.35082
100157520.453820.191560.26939
50165080.344420.413960.37599
1575155050.367740.389090.37810
100142520.361610.378180.36969
50165090.360010.388650.37377
2075155060.382200.386030.38410
100147540.384280.401160.39253
50165090.357200.390250.37298
2575155060.375590.393450.38431
100135050.395560.394470.39501
Values in bold indicate the parameters that resulted in the best F1 Score for each edit distance. The row highlighted in cyan represents the highest F1 Score obtained across the entire species.
Table 5. Best F1 score by edit distance for k-mer size 50: Bos taurus (cow).
Table 5. Best F1 score by edit distance for k-mer size 50: Bos taurus (cow).
Edit Distancek-mer SizeThresholdMinimum ProbesPrecisionRecallF1 Score
55090040.882030.486940.62747
105090040.865440.546440.66990
155090040.863540.560660.67989
205090040.854130.563300.67887
255090040.860450.564620.68182
The row highlighted in cyan represents the highest F1 Score obtained across the entire species.
Table 6. Time required by L1PD with different k-mer sizes in different genomes with default values.
Table 6. Time required by L1PD with different k-mer sizes in different genomes with default values.
Speciesk-mer SizeNumber of k-mersTime Used by L1PD
502438 min, 8 s
Dog751448 min, 24 s
1001044 min, 4 s
503249 min, 27 s
Horse751352 min, 25 s
100948 min, 32 s
Cow50452 min, 24 s
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

López, J.O.; Quiñones, J.L.; Martínez, E.D. Improved LINE-1 Detection through Pattern Matching by Increasing Probe Length. Biology 2024, 13, 236. https://doi.org/10.3390/biology13040236

AMA Style

López JO, Quiñones JL, Martínez ED. Improved LINE-1 Detection through Pattern Matching by Increasing Probe Length. Biology. 2024; 13(4):236. https://doi.org/10.3390/biology13040236

Chicago/Turabian Style

López, Juan O., Javier L. Quiñones, and Emanuel D. Martínez. 2024. "Improved LINE-1 Detection through Pattern Matching by Increasing Probe Length" Biology 13, no. 4: 236. https://doi.org/10.3390/biology13040236

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop