Genome Assemble & Annotation
De Novo Assembly
De Novo Genome sequencing and assembly is the method of choice to resolve the genetic makeup of an uncharacterized genome for which no prior reference or nucleotide sequence exits. With its prodigious throughput, efficiency and high speed next-generation sequencing enables us to sequence whole genome at high coverage. Sophisticate and complex assembly algorithms are then applied to resolve the genomics sequence which reveals the gene structure and positioning.
There are several advantages that a resolved genome could provide:
Genome assembly could be classified into two types:
• Reference-based assembly:
– Assess quality of sequencing (re-sequencing)
– Identify and annotate novel features, etc.
• De novo assembly:
– generate reference
– Identify novel features
– Annotate existing but un-annotated features
A typical genome assembly workflow is displayed, these steps make use of various bioinformatics tools and algorithm to generate final genome assembly and annotation.
There are many genome sequencing techniques available, these include
– Short read next-generation sequencing: Illumina and Ion Torrent
– Long read next-generation sequencing: Pacific Biosciences and Oxford Nanopore
Each of these sequencing techniques has its pros and cons related to genome assembly. Short reads are high quality, cost effective and provide deep sequencing coverage, however, they tend to have coverage bias in regions of high AT or GC content. Most of such high AT / GC content regions are repeats and low complexity regions. Short read lengths and biased coverage in repeat and low complexity regions results into fragmented genome assemblies that provide partial yet critical overview of genetic makeup of an organism. Most of the short read assemblers adopt De-Bruijn graph based assembly. Figure below adopted from Namiki et al, Nucleic Acid Research (2012) represents a typical De-Bruijn graph assembly protocol.
Long reads are >10kb average reads lengths but lower quality with random errors. Long reads sequencing requires high molecular weight starting DNA which at times require expertise in sample extraction. In general, long read assemblies have better contiguity, large N50 values and higher genomic coverage as compared to short reads. These long read assemblies, however, do require polishing using short reads to correct random base calls errors.
Long read assembly use OLC (Overlap Layout Consensus) approach to assemble genome. Here is a representation of Pacbio’s assembly process for bacterial genome called hierarchical genome assembly process (HGAP).
Table below compares Illumina and Pacbio bacterial assembly. Clearly long reads generate finished bacterial genomes ready to annotate.
Utturkar et al. A Case Study into Microbial Genome Assembly Gap Sequences and Finishing Strategies. Frontiers in Microbiology. 2017;8:1272. doi:10.3389/fmicb.2017.01272.
A number of recent studies have been published that use Pacbio long reads and various assemblers for genome assembly. Some of the key studies include:
|Organism||Technology||Assembly tool||Genome Size||Contig N50 (Mb)||Scaffold N50 (Mb)|
|Taeniopygia guttata||Humming Bird||PB||FALCON||1.1GB||5.8||Na|
|Utricularia gibba||Carnivorous Plant||PD||HGAP3||82mb||3.42||NA|
|Vitis vinifera||Humming Bird||Vine||PB||FALCON||500mb||2.39||NA|
|Oreochromis niloticus||Nile Tilapia||PB+RH +RAD map||Canu||815Mb||3.1||NA|
|Lates calcalifer||Sea Bass||PB+OM+LM||PB+OM+LM||700mb||1.72||25.85|
|Euclidium syriacum||Mustard Family||PB+OM||FALCON||262Mb||3.3||17.5|
PB, PacBio SMRT data; OM, Optical mapping data; LR, Linked reads; LM, Linkage maps.
More recently Oxford Nanopore Technology (ONT) sequencing has immerged as another long read technology that is now activity used for genome assembly. ONT reads are similar to Pacbio in average read lengths and slightly high error rates. Illumina sequencing reads are used to error correct ONT reads and assemblies to enhance final basecall quality. Here are few recent studies that used ONT for genome sequencing.
- Nanopore sequencing and assembly of a human genome with ultra-long reads. 29th Jan, 2018, Nature Biotech
- High contiguity Arabidopsis thaliana genome assembly with a single nanopore flow cell. 14th June 2017, Nature Communications
- Community-led comparative genomic and phenotypic analysis of the aquaculture pathogen Pseudomonas baetica a390T sequenced by Ion semiconductor and Nanopore technologies. 22 March 2018 – FEMS Microbiology Letters
- De novo whole-genome assembly of a wild type yeast isolate using nanopore sequencing. 3rd May 2017 – f1000
- De novo Assembly of a New Solanum pennellii Accession Using Nanopore Sequencing
- 21st April 2017, Plant Cell 1010Genome have developed robust pipeline and scientific expertise to handle any single platform or hybrid approach for denovo genome assembly
Key Genome Assembly projects handled by us
|Organism||Genome Size||NGS Coverage||Assembly Size||# of Contig||Contig N50 (Mb)|
|Bacteria||2Mb||Illumina (80x)||1.96||8 Scaffolds|
|Bacteria||4.3Mb||Illumina (100x)||3.9||21 Scaffolds|
|Bacteria||4.3||Pacbio (70x)||4.3Mb||Single closed plus plasmid|
|Multi-nucleated Fungi||42Mb||Pacbio (60x)||42.5Mb||298||419kb|
|Rice||400Mb||Illumina (80x) and Pacbio (20x)||390Mb and Pacbio (20x)||3800||1.2Mb|
|Bean||440Mb||and Pacbio (20x)||435Mb||3007||0.9Mb|
|Bacteria||2Mb||ONT (50x) Illumina(30x)||1.99Mb||Single Closed Plus Plasmid||1.4Mb|
|Yeast||12.5Mb||ONT (60x) Illumina(30x)||13Mb||75||401kb|
Convinced that we can handle genome assemblies!!