MetAMOS directory structure¶

Directory layout/description¶

All of the step mentioned below have the following directory structure:

[STEP]/in  -> required input
[STEP]/out -> generated output

We will now describe in detail the functionality of each step, along with the expected input & output.

Preprocess¶

Required step?¶

Yes

Software currently supported¶

ea-utils (code.google.com/p/ea-utils) - optional, off by default. Enabled by passing and trim option to runPipeline.
$ -t eautils
- Assemblers that do not perform trimming can benefit from enabling this step. On a GAGE-B dataset, the assemblers which had a higher corrected N50 on trimmed data than untrimmed were:
  IDBA-UD, SGA, SparseAssembler, SPAdes, Velvet-SC, and Velvet.
- Assembler which had a higher corrected N50 on untrimmed data were:
  ABySS, MaSuRCA, MIRA, Ray, and SOAPdenovo2.
FastQC (bioinformatics.babraham.ac.uk) - optional, on by default for iMetAMOS, used to generate quality reports for the input sequencing data.
KmerGenie (Chikhi et al 2014) - optional, on by default for iMetAMOS, used to auto-select a k-mer for isolate genome assembly. Alternatively, a list of k-mers can be specified instead. For assemblers using a range of k-mers (i.e. IDBA-UD), KmerGenie is not used but the read length is specified as the maximum k-mer. For assemblers using a set of k-mers (i.e. SPAdes), the KmerGenie selected k-mer along with a set of defaults is used.

What it does¶

Quality control
Read filtering
Read trimming
Sanity checks on fasta/q files
Conversion to required formats

Expected input¶

Raw reads

Expected output¶

Cleaned reads
Quality report
Converted files

Assemble¶

Required step?¶

No

Software currently supported¶

ABySS (Simpson et al 2009)
CABOG (Miller et al 2008)
IDBA-UD (Peng et al 2012)
MaSuRCA (Zimin et al 2013)
MetaVelvet (Namiki et al 2011)
Mira (Chevreux et al 1999)
RayMeta (Boisvert et al 2012)
SGA (Simpson et al 2012)
SOAPdenovo2 (Luo et al 2012)
SPAdes (Bankevich et al 2012)
SparseAssembler (Ye et al 2012)
Velvet (Zerbino et al 2008)
Velvet-SC (Chitsaz et al 2011)

What it does¶

Construct assembly (no scaffolds)

Expected input¶

Cleaned reads

Expected output¶

Unitigs
Contigs
Singletons
Degenerates/Surrogates

FindORFs¶

Required step?¶

No

Software currently supported¶

FragGeneScan (Rho, 2010)
MetaGeneMark (Zhu, 2010)
Prokka (Seemann, 2013)

What it does¶

Finds/predicts ORFs in contigs

Expected input¶

Assembled contigs in fasta format (>300bp)

Expected output¶

ORFs in multi-fasta format (FAA,FNA)

Validate¶

Required step?¶

No

Software currently supported¶

ALE (Clark et al 2013)
CGAL (Rahman et al 2013)
FRCbam (Vezzi et al 2013)
FreeBayes (Garrison et al 2012)
LAP (Ghodsi et al 2013)
QUAST (Gurevich et al 2013)
REAPR (Hunt et al 2013)

What it does¶

Checks assembly correctness using intrinsic quality metrics

Expected input¶

Assembled contigs in fasta format

Expected output¶

List of errors
Poorly assembled regions
Assembly quality metrics

FindRepeats (deprecated)¶

This step was initially added to help speed up Bambus 2 repeat identification step; optimizations to Bambus 2 have made this speed-up unnecessary. Step is turned off by default.

Required step?¶

No

Software currently supported¶

Repeatoire

What it does¶

Find contigs (or parts of contigs) that appear to be repetitive and flag for further steps.

Expected input¶

Assembled contigs in fasta format

Expected output¶

List of contigs likely to be repeats

Abundance (deprecated)¶

This step was created to estimate taxonomic abundance of a give metagenomic sample

Required step?¶

No

Software currently supported¶

MetaPhyler (Liu et al 2011)

What it does¶

Find contigs (or parts of contigs) that appear to be repetitive and flag for further steps.

Expected input¶

Assembled contigs in fasta format

Expected output¶

List of contigs likely to be repeats

Classify¶

Required step?¶

Yes

Software currently supported¶

FCP
Kraken
Phylosift

What it does¶

Labels contigs with taxonomic id

Expected input¶

Multi-fasta file of contigs

Expected output¶

Text file containing contig id to taxonomic id 1-to-1 mapping

FunctionalAnnotation¶

Required step?¶

No

Software currently supported¶

BLAST

What it does¶

Assigns functional annotation to ORFs

Expected input¶

ORFs in multi-fasta format (FAA,FNA)

Expected output¶

Text file containing functional labels for ORFs

Scaffold¶

Required step?¶

Yes

Software currently supported¶

Bambus2 (Koren, 2011)

What it does¶

Link together contigs using mate-pairs. Also identify variant patterns.

Expected input¶

Assembled contigs in fasta format

Expected output¶

scaffolds in agp format
scaffolds in fasta format
motifs/variants
longer contigs in fasta format

Propagate¶

Required step?¶

Yes

Software currently supported¶

NA

What it does¶

Propagate taxonomic labels along scaffolds

Expected input¶

Scaffolds in agp format
Contig taxonomic labels

Expected output¶

contig taxonomic labels

FindScaffoldORFs¶

Required step?¶

No

Software currently supported¶

FragGeneScan
MetaGeneMark

What it does¶

Find ORFs in scaffolds, mainly serves as an extra validation step after Scaffold.

Expected input¶

Scaffolds in agp format

Expected output¶

Multi-fasta file of ORFs as fna,faa

Binning¶

Required step?¶

Yes

Software currently supported¶

NA

What it does¶

Bins contigs/scaffold by taxonomic label

Expected input¶

Multi-fasta file of contigs
Multi-fasta file of scaffolds

Expected output¶

Binned out contigs/scaffolds by directory

Postprocess¶

Required step?¶

Yes

Software currently supported¶

Krona (Ondov, 2010)

What it does¶

Generates summary reports
Collates output
Generates combined HTML page

Expected input¶

Majority of the aforementioned outputs

Expected output¶

HTML summary file
Output directory tree