MetAMOS directory structure¶
Directory layout/description¶
All of the step mentioned below have the following directory structure:
[STEP]/in -> required input
[STEP]/out -> generated output
We will now describe in detail the functionality of each step, along with the expected input & output.
Preprocess¶
Required step?¶
- Yes
Software currently supported¶
ea-utils (code.google.com/p/ea-utils) - optional, off by default. Enabled by passing and trim option to runPipeline.
$ -t eautils
Assemblers that do not perform trimming can benefit from enabling this step. On a GAGE-B dataset, the assemblers which had a higher corrected N50 on trimmed data than untrimmed were:
IDBA-UD, SGA, SparseAssembler, SPAdes, Velvet-SC, and Velvet.
Assembler which had a higher corrected N50 on untrimmed data were:
ABySS, MaSuRCA, MIRA, Ray, and SOAPdenovo2.
FastQC (bioinformatics.babraham.ac.uk) - optional, on by default for iMetAMOS, used to generate quality reports for the input sequencing data.
KmerGenie (Chikhi et al 2014) - optional, on by default for iMetAMOS, used to auto-select a k-mer for isolate genome assembly. Alternatively, a list of k-mers can be specified instead. For assemblers using a range of k-mers (i.e. IDBA-UD), KmerGenie is not used but the read length is specified as the maximum k-mer. For assemblers using a set of k-mers (i.e. SPAdes), the KmerGenie selected k-mer along with a set of defaults is used.
What it does¶
- Quality control
- Read filtering
- Read trimming
- Sanity checks on fasta/q files
- Conversion to required formats
Expected input¶
- Raw reads
Expected output¶
- Cleaned reads
- Quality report
- Converted files
Assemble¶
Required step?¶
- No
Software currently supported¶
- ABySS (Simpson et al 2009)
- CABOG (Miller et al 2008)
- IDBA-UD (Peng et al 2012)
- MaSuRCA (Zimin et al 2013)
- MetaVelvet (Namiki et al 2011)
- Mira (Chevreux et al 1999)
- RayMeta (Boisvert et al 2012)
- SGA (Simpson et al 2012)
- SOAPdenovo2 (Luo et al 2012)
- SPAdes (Bankevich et al 2012)
- SparseAssembler (Ye et al 2012)
- Velvet (Zerbino et al 2008)
- Velvet-SC (Chitsaz et al 2011)
What it does¶
- Construct assembly (no scaffolds)
Expected input¶
- Cleaned reads
Expected output¶
- Unitigs
- Contigs
- Singletons
- Degenerates/Surrogates
FindORFs¶
Required step?¶
- No
Software currently supported¶
- FragGeneScan (Rho, 2010)
- MetaGeneMark (Zhu, 2010)
- Prokka (Seemann, 2013)
What it does¶
- Finds/predicts ORFs in contigs
Expected input¶
- Assembled contigs in fasta format (>300bp)
Expected output¶
- ORFs in multi-fasta format (FAA,FNA)
Validate¶
Required step?¶
- No
Software currently supported¶
- ALE (Clark et al 2013)
- CGAL (Rahman et al 2013)
- FRCbam (Vezzi et al 2013)
- FreeBayes (Garrison et al 2012)
- LAP (Ghodsi et al 2013)
- QUAST (Gurevich et al 2013)
- REAPR (Hunt et al 2013)
What it does¶
- Checks assembly correctness using intrinsic quality metrics
Expected input¶
- Assembled contigs in fasta format
Expected output¶
- List of errors
- Poorly assembled regions
- Assembly quality metrics
FindRepeats (deprecated)¶
This step was initially added to help speed up Bambus 2 repeat identification step; optimizations to Bambus 2 have made this speed-up unnecessary. Step is turned off by default.
Required step?¶
- No
Software currently supported¶
- Repeatoire
What it does¶
- Find contigs (or parts of contigs) that appear to be repetitive and flag for further steps.
Expected input¶
- Assembled contigs in fasta format
Expected output¶
- List of contigs likely to be repeats
Abundance (deprecated)¶
This step was created to estimate taxonomic abundance of a give metagenomic sample
Required step?¶
- No
Software currently supported¶
- MetaPhyler (Liu et al 2011)
What it does¶
- Find contigs (or parts of contigs) that appear to be repetitive and flag for further steps.
Expected input¶
- Assembled contigs in fasta format
Expected output¶
- List of contigs likely to be repeats
Classify¶
Required step?¶
- Yes
Software currently supported¶
- FCP
- Kraken
- Phylosift
What it does¶
- Labels contigs with taxonomic id
Expected input¶
- Multi-fasta file of contigs
Expected output¶
- Text file containing contig id to taxonomic id 1-to-1 mapping
FunctionalAnnotation¶
Required step?¶
- No
Software currently supported¶
- BLAST
What it does¶
- Assigns functional annotation to ORFs
Expected input¶
- ORFs in multi-fasta format (FAA,FNA)
Expected output¶
- Text file containing functional labels for ORFs
Scaffold¶
Required step?¶
- Yes
Software currently supported¶
- Bambus2 (Koren, 2011)
What it does¶
- Link together contigs using mate-pairs. Also identify variant patterns.
Expected input¶
- Assembled contigs in fasta format
Expected output¶
- scaffolds in agp format
- scaffolds in fasta format
- motifs/variants
- longer contigs in fasta format
Propagate¶
Required step?¶
- Yes
Software currently supported¶
- NA
What it does¶
- Propagate taxonomic labels along scaffolds
Expected input¶
- Scaffolds in agp format
- Contig taxonomic labels
Expected output¶
- contig taxonomic labels
FindScaffoldORFs¶
Required step?¶
- No
Software currently supported¶
- FragGeneScan
- MetaGeneMark
What it does¶
- Find ORFs in scaffolds, mainly serves as an extra validation step after Scaffold.
Expected input¶
- Scaffolds in agp format
Expected output¶
- Multi-fasta file of ORFs as fna,faa