INTRODUCTION
The “Central Dogma” of molecular biology states that genetic information is typically processed from DNA to RNA to protein. Yet, less than 3% of transcripts encode proteins and ~80% of mammalian DNA are transcribed into non-coding RNAs (ncRNAs). Among these, long non-coding RNAs (lncRNAs) have emerged as pivotal regulators in gene expression. LncRNAs are typically defined as transcripts exceeding 200 nucleotides in length with low or no protein-coding potential. Their diverse functions, myriad isoforms and intricate relationships with other genes present challenges in their classification and annotation.
Beyond their nuclear roles, lncRNAs are instrumental in the cytoplasm, regulating processes like translation, metabolism, and signaling. Their modular structure, enriched with repetitive elements, allows them to bind complementary RNA molecules, modulating gene expression across various stages, including both pre- and post-transcriptional and translational levels.
Given their involvement in numerous biological processes and disease pathways, lncRNAs are emerging as promising therapeutic targets and diagnostic markers. Their intricate structures and sequences of lncRNAs offer a multitude of druggable sites, providing unique binding regions for potential small molecules or peptide-based therapeutics.In drug discovery, lncRNAs can serve both as targets and tools. Their ability to 'sponge' microRNAs, for instance, can be harnessed to design RNA-based drugs targeting specific disease-linked miRNAs. Their cell-specific nature further positions them as promising biomarkers for disease detection and treatment monitoring. As the mysteries of lncRNAs continue to unfold, they are poised to become a cornerstone in the next frontier of medicinal chemistry, drug discovery, and molecular pharmacology. By harnessing the potential of these versatile molecules, we may pave the way for revolutionary treatments and diagnostic techniques in the future.
The complexity of the lncRNA landscape is underscored by the limitations of traditional sequencing methods, which often struggle to capture the full spectrum of lncRNA diversity and short reads technology normally fails when detecting the full-length of these molecules. The advent of game-changing long-read sequencing technologies, such as Oxford Nanopore Technologies (ONT) and PacBio is transforming our ability to detect elusive lncRNAs, particularly those with extended transcripts or intricate secondary structures. Additionally, Nanopore sequencing allows for an almost immediate interpretation of the results, accelerating the discovery phase in research settings.
While the rat serves as a cornerstone model organism in scientific research due to its physiological and genetic similarities to humans, its transcriptome is surprisingly under-annotated compared to species involved in the ENCODE initiative, such as humans and mice. This disparity is particularly striking given the pivotal role of the rat in biomedical studies. Thousands of lncRNAs and transcript isoforms in the rat remain undiscovered or inadequately characterized. This gap not only hampers our understanding of rat biology but also limits the translational potential of rat-based studies to human health. The under-annotation underscores the urgent need for comprehensive transcriptomic studies in rats, leveraging advanced sequencing technologies and computational tools. By bridging this knowledge gap, we can unlock the full potential of the rat as a model system, ensuring that insights gleaned from it are both accurate and directly applicable to understanding human biology and disease.
In this study, apart of integrating publicly available Nanopore RNA-seq data to identify lncRNAs in rats we will also embark on a comprehensive benchmarking of tools designed to assess the coding potential of transcripts. Evaluating the coding potential is a pivotal step in transcriptomic projects, especially once a repertoire of potential lncRNAs has been identified. Given the rapid advancements in computational methods, tools leveraging artificial intelligence (AI) have emerged at the forefront of this domain. We will specifically compare the performance of AI-based tools against traditional methods. By doing so, we aim to determine the efficacy, accuracy, and reliability of AI-driven approaches in discerning the coding potential of transcripts. Such a comparative analysis not only provides insights into the strengths and limitations of each tool but also guides researchers in selecting the most appropriate method for their specific transcriptomic endeavors.
AIMS
General objective
Leverage public databases containing Nanopore sequencing transcriptomic data for rats to establish a robust in silico analysis pipeline for the identification and characterization of long non-coding RNAs.
Specific objectives
Assess and reconstruct the long non-coding transcriptome from a public dataset generated by Nanopore technology to uncover potential novel transcripts.
Benchmark tools to evaluate the coding potential of newly identified lncRNAs.
METHODS
Exploratory metadata analysis of public datasets
We conducted a targeted search within the SRA, GEO, and ENA databases, focusing on datasets related to rats, RNA-seq, and long-read sequences generated by ONT sequencers. Utilizing the data procured from the SRA Run Selector, we performed an in-depth metadata analysis using a custom Python script. The raw transcriptomic data for the chosen dataset was subsequently downloaded through the Linux terminal, employing the fastq-dl tool.
Quality control and data filtering
We employed FastQC and Nanoplot, inclusive of the NanoStats package, for initial data assessment. An aggregated overview was generated using MultiQC. The data was then filtered with Nanofilt, adhering to parameters of q?10 and a maximum read size of ?15k bp. From this process, four samples with the longest reads were chosen, representing each group from the project's experimental design related to the dataset.
Splice-aware mapping against the rat reference genome
Using the minimap2 tool, for a splice-aware mapping, the selected samples were individually mapped against the rat reference genome (mRatBN7.2/rn7), considering the gene structure from two reference databases: RefSeq and UCSC Genome Browser. A filter was applied based on the mapping quality against the reference genome, which was implemented by samtools, where the mapping quality for the aligned sequences is q ? 40, representing an accuracy of 99.99%.
Transcript reconstruction: de novo and guided assembling
In the reference-guided assembly strategy, a previously known reference genome annotation serves as a guide for the assembly of RNA sequences. However, it may not detect new transcripts or splice variants that have not yet been annotated in that genome. Nevertheless, implementing the de novo assembly strategy, RNA sequences are assembled without the guidance of a reference genome. The method seeks to overlap sequences and group them to compose more extensive transcript sequences. This approach is particularly useful in contexts where a high-quality reference genome is not available, or when the aim is to discover new transcripts or splice variants not contained in pre-existing annotations.
It was decided to carry out a comparative analysis between both methodologies, using the bioinformatics tool StringTie. This decision provides an in-depth view of the transcriptome structure. Thus, the combined transcriptome of the 4 selected samples was assembled under the guidance of the rat reference genome from the UCSC and RefSeq databases, as well as going through the de novo assembly process.
Characterization and identification of new lncRNAs
The alignment of results was performed with the rn7 reference annotation, using GffCompare, which provides a contextualized and detailed view of transcript structure in the rat genome, being possible to identify transcripts that agree with existing annotations, as well as those that may represent new transcripts or variations of already known isoforms. Transcripts categorized as intergenic, intronic and antisense are of special interest, as they often indicate regions of the genome that do not encode proteins, but may have regulatory functions or other significant biological activities. Moreover, transcripts without a side of the cDNA strand specified were removed and the remaining transcripts followed up to statistical analysis.
lncRNAs have been arbitrarily defined as non-coding transcripts that have more than 200 nucleotides (nt), which exclude the majority of RNAs infrastructures. Furthermore, this definition also rules out other well-known short RNAs. Given the diversity of sizes, Mattick et al. (2023), in a consensus statement, propose that lncRNAs are transcripts with more than 500 nt, mostly generated by RNA Pol-II and which are classified as intergenic, antisense or intronic.
Given the characteristics of lncRNAs, the implementation of both 200nt and 500nt filters is essential to evaluate their impact on the identification of possible novel lncRNAs.
Benchmarking of AI tools for the coding potential inspection
A benchmarking of AI tools for coding potential prediction was implemented, since there is no tool specific for rats and long-read sequences. Thus, tests were performed using the pre-trained models of RNAmining and LncADeep for mice (Mus musculus), the most phylogenetically close organism to rats.
RNAmining works under a XGBoost (eXtreme Gradient Boosting) machine learning algorithm, which optimizes the prediction accuracy by sequentially combining weak models (typically decision trees) to form a strong ensemble model. Key features include handling missing values, parallel processing for faster computation, and regularization to prevent overfitting.
LncADeep identifies lncRNAs by integrating sequence intrinsic and homology features based on Deep Belief Networks (DBN) from a de novo perspective. DBN is a generative probabilistic model that comprises multiple layers of stochastic, latent variables and hidden units, being capable of representing complex data distributions. They are trained layer by layer in an unsupervised manner, then fine-tuning can be done using conventional backpropagation.
Known coding and lncRNA sequences from the rat reference genome (rn7/Ensembl) were tested in both tools to validate the algorithms’ precision and sensitivity to correctly predict rat lncRNA sequences.
RESULTS AND DISCUSSION
Exploratory metadata analysis of public datasets
From the exploratory analysis of metadata available on the SRA platform, it was possible to identify that all runs were cDNA libraries made on MinION. The data retrieved is divided into 4 BioProjects: PRJNA517125, PRJEB51442, PRJNA904815, and PRJNA910244. However, due to data availability and specificity, the datasets from the PRJNA517125 and PRJNA904815 projects were chosen to develop an analysis pipeline in order to discover novel transcripts, reconstruct mRNA and ncRNA transcripts, characterize splicings and discover novel lncRNAs, as also futurely develop novel strategies of drug targeting and disease treatments.
In BioProject PRJNA517125, heart tissue data were sequenced from 4 different strains of female rats in embryonic stage knockdown for RBFOX2. Taxonomy analysis from samples has a most complete read of 82.76%. For Bioproject PRJNA904815, it was sequenced 20 samples of Wistar rats’ hippocampal neurons, 5 controls and 15 treated cells split in equal groups of different times of picrotoxin exposure (30min, 60min and 5h), and 60% got an activation response. Taxonomy analysis from samples has the best identified reads among the projects, with the most complete reading of 99.25%.
Quality control and filtering
After a detailed analysis of the results provided for the cardiomyocyte dataset, it was found that the samples did not present duplicate sequences. This result is notable, especially considering that in RNA-seq cDNA sequencing data it is common to find duplicates, so this is the main reason for removing this dataset from the analysis pipeline.
Thus, the hippocampus dataset was selected to develop the pipeline, and samples with the longest reads were, considering the experimental design: control (SRR22399484), exposure of hippocampal neurons to picrotoxin for 30min (SRR22399494), 60 min (SRR22399493) and 5h (SRR22399496).
Splice-aware mapping against the rat reference genome
Sequences were mapped against both the rn7 of RefSeq and UCSC databases. While some sequences did not align with their respective references, a significant number were retained after applying a quality filter (q ? 40) for both databases. Specifically, for RefSeq: 1.845.369, 760.753, 1.179.053, and 1.181.535 sequences remained, respectively, and for UCSC: 1.845.394, 760.763, 1.179.089, and 1.181.523 remained, respectively.
Transcript reconstruction: de novo and guided assembling by reference annotation
Considering the mapping results against rn7/RefSeq, using the annotation-guided assembly methodology for the same reference, 26.684 transcripts were reconstructed. In contrast, with the de novo assembly method, 23.288 transcripts were assembled. Elaborating on the transcript mapping against rn7/UCSC, for the annotation-guided assembly methodology, 26.742 transcripts were reconstructed. Conversely, using the de novo assembly, 23.205 transcripts were assembled.
Most reconstructed transcripts are multi-exonic in both methods, suggesting complexity with potential regulatory or coding roles. Multi-exonic transcripts are typically stable and less likely artifacts. Also, many loci produce multiple transcripts, highlighting the prevalence of alternative splicing and resulting isoforms from a single genetic locus, a known feature of eukaryotic genomes enriching protein and functional diversity.
Characterization and identification of new lncRNAs
All transcripts selected as possible novel were ? 200nt length, fitting the classical minimal size of lncRNAs, most of them also being longer than 500nt.
Using the rn7/RefSeq with annotation-guided assembly, comparing with the reference gene annotation in .gff format, were found 412 intergenic transcripts, 113 intronic transcripts, and 350 antisense. For the de novo assembly, there were 403 intergenic, 114 intronic, and 399 antisense. For the rn7/UCSC mapping with annotation-guided assembly, were found 519 intergenic transcripts, 140 intronic, and 353 antisense. Using de novo assembly, there were 508 intergenic, 146 intronic, and 381 antisense.
Strandless transcripts exclusion minimizes potential noise, emphasizing that all such transcripts are mono-exonic. Antisense transcripts inherently have strand orientation due to their opposite overlap with the reference strand. Intergenic transcripts are dominant in both analyses, followed by antisense and then intronic.
Benchmarking of AI tools for validation of coding potential
When testing LncADeep mice model for predicting coding potential of the 45.938 known rat coding sequences (CDS), the DBNs algorithm predicted correctly ~97.4% of sequences as coding, and misclassified 2444 (~5.3%) as lncRNA. On the other hand, when running prediction for 4122 known lncRNA sequences, it predicted correctly ~96.90% of the sequences as lncRNA, and misclassified 128 (~3,1%) as CDS.
Tests with RNAmining XGBoost algorithm were more promising, as when testing the mice model against the known rat CDS, the tool predicted correctly ~99.43% as coding sequences, and misclassified 263 (~0,57%) as ncRNA. Nonetheless, when running prediction for the known lncRNA sequences, it predicted correctly ~99,44% as lncRNA, and misclassified 23 (~0,56%) as CDS. Accordingly, RNAmining was chosen as the best tool for validating the potential new lncRNAs’ non-coding nature.
In the RNAmining coding prediction results for the UCSC database, using the guided assembly, 503 out of 519 intergenic transcripts were identified as non-coding. Additionally, 138 out of 140 intronic transcripts and 347 out of 353 antisense transcripts were also non-coding. When considering the 500nt guided assembly for the same database, 408 out of 414 intergenic transcripts, all 102 intronic transcripts, and 299 out of 303 antisense transcripts were non-coding. In the de novo assembly for UCSC, 492 out of 508 intergenic transcripts, 144 out of 146 intronic transcripts, and a significant 379 out of 381 antisense transcripts were classified as non-coding. For the 500nt de novo assembly, 398 out of 404 intergenic, 105 out of 106 intronic, and 335 out of 336 antisense transcripts were determined to be non-coding.
For the RefSeq database results, in the guided assembly, 402 out of 412 intergenic transcripts, 112 out of 113 intronic transcripts, and 344 out of 350 antisense transcripts were found to be non-coding. In the 500nt version of the guided assembly, 313 out of 314 intergenic transcripts, all 81 intronic transcripts, and 294 out of 299 antisense transcripts were non-coding. For the de novo assembly of RefSeq, 393 out of 403 intergenic, all 114 intronic, and 393 out of 399 antisense transcripts were non-coding. Lastly, in the 500nt de novo assembly, 305 out of 306 intergenic, all 80 intronic, and 342 out of 347 antisense transcripts were identified as non-coding.
CONCLUSION
The research detailed an intricate analysis of Nanopore RNA-seq data from the rat model, with a spotlight on hippocampal neuron samples. One of the cornerstones of this study was the reconstruction of transcripts, elucidating de novo methodologies. Harnessing the power of state-of-the-art computational techniques, the research could catalog RNA sequences that had evaded documentation in earlier works. By juxtaposing these sequences against the reference genome, the study could classify these transcripts, laying a foundation for deeper insights into gene expression variances, the identification of alternate isoforms, and the intricate dance of gene regulation. This exploration illuminated a realm of hitherto uncharted complexity within the known transcriptional landscape.
While the primary intent behind this study was to craft and standardize a robust analytical pipeline, it's impossible to overlook the broader implications, especially when considering the findings of Yao et al. (2021). The unearthing of potential new transcripts in hippocampal neurons invites a flurry of intriguing biological inquiries regarding their function and significance, which might play pivotal roles in modulating neuroplasticity, an elemental mechanism driving learning and memory processes. Delving deeper, these findings also hint at the vast potential of lncRNAs as key players in pharmacology and medicine. Recognized for their multifaceted roles in cellular biology, lncRNAs have piqued interest for their potential as both therapeutic targets and bioactive agents. As our understanding of the transcriptome continues to evolve, it underscores the need for further research to unravel the potential of these transcripts in influencing hippocampal neuroplasticity and their broader applicability in medical therapeutics.
ACKNOWLEDGMENTS
We have special gratitude for the financial support of Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq/Brasil ? PVL15993-2022) and Agencia Nacional de Investigación y Desarrollo (ANID/Chile) ? FONDECYT (1211731), FONDAP (15120011), STIC/AmSud (STIC2020008) and Anillo (ACT210004 and ATE220016).
REFERENCES
HUANG, S. et al. LncRNAs as Therapeutic Targets and Potential Biomarkers for Lipid-Related Diseases. Frontiers in Pharmacology, 4 ago. 2021. v. 12.
MATTICK, J. S. et al. Long non-coding RNAs: definitions, functions, challenges and recommendations. Nature Reviews Molecular Cell Biology, 3 jan. 2023. v. 24, 430–447.
RAGHAVAN, V. et al. A simple guide to de novo transcriptome assembly and annotation. Briefings in Bioinformatics, 24 jan. 2022. v. 23, n. 2.
RAMOS, T. A. R. et al. RNAmining: A machine learning stand-alone and web server tool for RNA coding potential prediction. f1000research.com, 8 jun. 2021.
WINKLE, M. et al. Noncoding RNA therapeutics — challenges and potential solutions. Nature Reviews Drug Discovery, 18 jun. 2021.
YANG, C. et al. LncADeep: an ab initio lncRNA identification and functional annotation tool based on deep learning. Bioinformatics (Oxford, England), 15 nov. 2018. v. 34, n. 22, p. 3825–3834.
YAO, Z. et al. A taxonomy of transcriptomic cell types across the isocortex and hippocampal formation. Cell, jun. 2021. v. 184, n. 12, p. 3222-3241.e26.
Comissão Organizadora
Francisco Mendonça Junior
Pascal Marchand
Teresinha Gonçalves da Silva
Isabelle Orliac-Garnier
Gerd Bruno da Rocha
Comissão Científica
Ricardo Olimpio de Moura