Transcriptome Annotation Pipeline for Non-Model Organisms


The introduction of high-throughput sequencing technologies allowed researchers to generate large amounts of genomic data at limited cost and time. This opportunity had a groundbreaking impact on the study of non-model organisms: above all, RNA-Seq and de novo transcriptome assembly represent a valuable source of information in species for which genomic resources are scarce or absent. However, sequencing and assembly are only the first steps, and an accurate annotation is fundamental for every kind of biological analysis. Annotation of transcriptomes from model organisms and their closely-related species is quite straightforward, and is generally based on simple sequence similarity searches. Conversely, non-model organisms require more complex and integrated procedures in order to infer remote homology and function. We developed an annotation pipeline specifically thought for transcriptomes of non-model organisms. It consists of an integrated approach that combines different bioinformatics tools to obtain: 1) ORF prediction, identification of putative pseudogenes and artificially fused transcripts; 2) coding sequence annotation based both on sequence similarity and on the identification of conserved domains by protein signature recognition; 3) functional annotation of coding sequences by the assignment of GO terms; 4) identification of orthologs and paralogs; 5) annotation of noncoding transcripts. The pipeline can be run automatically in its entirety, or selecting only some specific modules.

We tested our pipeline by annotating RNA-Seq data from the Manila clam, Ruditapes philippinarum (Bivalvia, Veneridae), and the European clam, Ruditapes decussatus (Bivalvia, Veneridae). Right now we are finalizing some steps in order to speed up the pipeline; the manuscript is in preparation.

Download the Poster presented at the SMBE 2016 Meeting from figshare.