ECOGEO 'Omics Training: 4.1 Assembly

Frank Aylward; Daniel Mende

Aug 18, 2016

Version 2

ECOGEO 'Omics Training: 4.1 Assembly V.2

DOI

dx.doi.org/10.17504/protocols.io.fi6bkhe

Frank Aylward¹,
Daniel Mende¹

¹EarthCube Oceanography and Geobiology Environmental 'Omics

ECOGEO

Elisha M Wood-Charlson

KBase

DOI: dx.doi.org/10.17504/protocols.io.fi6bkhe

Protocol Citation: Frank Aylward, Daniel Mende 2016. ECOGEO 'Omics Training: 4.1 Assembly. protocols.io https://dx.doi.org/10.17504/protocols.io.fi6bkhe

License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Protocol status: Working

Created: August 10, 2016

Last Modified: February 24, 2018

Protocol Integer ID: 3390

Abstract

Provides a short introduction to MEGAHIT, IDBA-UD, and SPAdes assemblers, a demo on Prodigal Gene Caller, and determining % of reads and contig coverage using Bowtie2 short read aligner.
 
Open this protocol inside the virtual machine (details in 'Start Instructions') for easy copy, paste of commands into the command line terminal window. 

Attachments

ECOGEO_4_1Assembly.p...

1.7MB

Guidelines

Screen Shot 2016-07-12 at 1.52.03 PM.png

Before start

Before starting, please visit the ECOGEO website for more information on this "Introduction to Environmental 'Omics" training series. The site contains a pre-packaged virtual machine that can be downloaded and used to run all of the protocols in this protocols.io collection. In addition to the VM, the website contains video and presentations from our initial "Intro to Env 'Omics" workshop held at the Univ. of Hawai'i at Manoa on 25-26 Jul 2016.

Please email ‘ecogeo-join@earthcube.org’ to join the ECOGEO listserv for future updates.

Introduction to assemblers

Move to directory containing assemblers. 
Command
$ cd /home/c-debi/ecogeo/assembly

View assembler parameters for MEGAHIT v1.0.3, IDBA-UD v1.1.1, and SPAdes v3.7.1
Command
These commands will show parameters for each assembler.
$ megahit
$ idba_ud
$ spades.py

Trimmomatic Quality Control:
Command
This step has already been completed for you.
PLEASE NOTE: Commands in black on the presentation, video should NOT be executed in the VM (assembly steps require more computational power).
$ java -jar trimmomatic-0.35.jar PE SRR606249_R1.fastq SRR606249_R2.fastq R1_pe R1_se R2_pe R2_se ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 SLIDINGWINDOW:10:28 MINLEN:50

Assemble with Megahit:
Command
This step has already been completed for you and the command does NOT need to be executed again.
$ megahit --preset meta-sensitive -1 SRR606249.trim_R1.fastq -2 SRR606249.trim_R2.fastq -o SRR606249.megahit_asm

IDBA-UD: merge FASTQ files to interleaved FASTA files
File_R1: >Seq1            File_R2: >Seq1
File_merged: >Seq1.1
                      >Seq2.1
 
Command
This step has already been completed for you and the command does NOT need to be executed again.
$ fq2fa --merge --filter SRR606249.trim_R1.fastq SRR606249.trim_R2.fastq SRR606249.trim.merged.fasta

Peform assembly using IDBA-UD:
Command
This step has already been completed for you and the command does NOT need to be executed again.
$ idba_ud -r SRR606249.trim.merged.fasta -o SRR606249.idbaud_asm --num_threads 45

Perform assembly using MetaSPAdes:
Command
This step has already been completed for you and the command does NOT need to be executed again.
$ spades.py -o ./SRR606249.spades_asm --meta -1 SRR606249.trim_R1.fastq -2 SRR606249.trim_R2.fastq --threads 60 --memory 600

Reference assessment: QUAST can perform comparisons against the reference genomes used to construct artifiial metagenome. Start with a baseline size of contiges (>1kb).
Command
$ seqmagick convert --min-length 1000 final.contigs.fa megahit_SRR606249.min1000.fasta

QUAST against 62 reference genomes:
Command
This step has already been completed for you and the command does NOT need to be executed again.
$ metaquast.py megahit_SRR606249.min1000.fasta -R ../Shakya_RefGenomes/

Prodigal Gene Caller

First step using prodigal:
 
File: spades_SRR606249.subset.fasta
(Contains a random subset of contigs from metaSPAdes output. )
Command
-a = output, protein translations
-d = output, nucleotide putative coding sequences
-i = input
-m = treats missing sequence (NNNs) as stop
-o = output, genbank format
-q = quiet output
$ prodigal -a temp1.orfs.faa -d temp1.orfs.fna -i spades_SRR606249.subset.fasta -m -o temp1.txt -p meta -q

Check temp1.orfs.faa output:
Command
Number of putative proteins.
$ less temp1.orfs.faa
$ grep '>' temp1.orfs.faa | wc -l

Visualize the first 10 header lines:
Command
$ grep '> </ProtocolCommand> 
<ProtocolResult	result=

Use Unix to simplify the header output:
Command
$ cut -f1 -d 

Repeat for nucleotides: 
Command
$ cut -f1 -d 

Determine putative genes for contigs from SPAdes: 

 
Repeat for Megahit and IDBA-UB
Command
$ prodigal -a temp1.orfs.faa -i spades_SRR606249.min1000.fasta -m -o temp1.txt -p meta -q
$ grep “>” temp1.orfs.faa | wc -l
$ cut -f1 -d 

Determining Coverage

Determining % of reads and contig coverage using Bowtie2 short read aligner.
 
Build index file of assembled contigs:
Command
This step has already been completed for you and the command does NOT need to be executed again.
$ bowtie2-build spades_SRR606249.min1000.fasta spades_SRR606249.min1000.bt_index

Perform alignment with trimmed, high-quality reads from SAM file output:
Command
This step has already been completed for you and the command does NOT need to be executed again.
$ bowtie2 -q -1 SRR606249.trim_R1.fastq -2 SRR606249.trim_R2.fastq -x spades_SRR606249.min1000.bt_index --no-unal -S spades_SRR606249.sam -p 35

Utilize featureCounts to determine reads aligned to a contig. Requires a pseudo-input file based on FASTA input. 
Command
The 
$ python fastaToSaf.py < spades_SRR606249.min1000.fasta > spades_SRR606249.min1000.saf
$ featureCounts -F SAF -a spades_SRR606249.min1000.saf -o spades_SRR606249.min1000.readcount spades_SRR606249.sam

Custom made Python script - convertReadcountToCoverage.py → can accept multiple readcount inputs to generate a combined coverage matrix:
Command
$ grep 

Public workspaceECOGEO 'Omics Training: 4.1 Assembly V.2

ECOGEO 'Omics Training: 4.1 Assembly V.2