Aug 18, 2016

Public workspaceECOGEO 'Omics Training: 4.1 Assembly V.2

  • Frank Aylward1,
  • Daniel Mende1
  • 1EarthCube Oceanography and Geobiology Environmental 'Omics
  • ECOGEO
Icon indicating open access to content
QR code linking to this content
Protocol CitationFrank Aylward, Daniel Mende 2016. ECOGEO 'Omics Training: 4.1 Assembly. protocols.io https://dx.doi.org/10.17504/protocols.io.fi6bkhe
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
Created: August 10, 2016
Last Modified: February 24, 2018
Protocol Integer ID: 3390
Abstract
Provides a short introduction to MEGAHIT, IDBA-UD, and SPAdes assemblers, a demo on Prodigal Gene Caller, and determining % of reads and contig coverage using Bowtie2 short read aligner.
Open this protocol inside the virtual machine (details in 'Start Instructions') for easy copy, paste of commands into the command line terminal window. 
Attachments
Guidelines


Before start
Before starting, please visit the ECOGEO website for more information on this "Introduction to Environmental 'Omics" training series. The site contains a pre-packaged virtual machine that can be downloaded and used to run all of the protocols in this protocols.io collection. In addition to the VM, the website contains video and presentations from our initial "Intro to Env 'Omics" workshop held at the Univ. of Hawai'i at Manoa on 25-26 Jul 2016. Please email ‘ecogeo-join@earthcube.org’ to join the ECOGEO listserv for future updates.
Introduction to assemblers
Introduction to assemblers
Move to directory containing assemblers. 
Command
$ cd /home/c-debi/ecogeo/assembly
View assembler parameters for MEGAHIT v1.0.3, IDBA-UD v1.1.1, and SPAdes v3.7.1
Command
These commands will show parameters for each assembler.
$ megahit
$ idba_ud
$ spades.py
Trimmomatic Quality Control:
Command
This step has already been completed for you. PLEASE NOTE: Commands in black on the presentation, video should NOT be executed in the VM (assembly steps require more computational power).
$ java -jar trimmomatic-0.35.jar PE SRR606249_R1.fastq SRR606249_R2.fastq R1_pe R1_se R2_pe R2_se ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 SLIDINGWINDOW:10:28 MINLEN:50
Assemble with Megahit:
Command
This step has already been completed for you and the command does NOT need to be executed again.
$ megahit --preset meta-sensitive -1 SRR606249.trim_R1.fastq -2 SRR606249.trim_R2.fastq -o SRR606249.megahit_asm
IDBA-UD: merge FASTQ files to interleaved FASTA files
File_R1: >Seq1            File_R2: >Seq1
File_merged: >Seq1.1
                      >Seq2.1
Command
This step has already been completed for you and the command does NOT need to be executed again.
$ fq2fa --merge --filter SRR606249.trim_R1.fastq SRR606249.trim_R2.fastq SRR606249.trim.merged.fasta
Peform assembly using IDBA-UD:
Command
This step has already been completed for you and the command does NOT need to be executed again.
$ idba_ud -r SRR606249.trim.merged.fasta -o SRR606249.idbaud_asm --num_threads 45
Perform assembly using MetaSPAdes:
Command
This step has already been completed for you and the command does NOT need to be executed again.
$ spades.py -o ./SRR606249.spades_asm --meta -1 SRR606249.trim_R1.fastq -2 SRR606249.trim_R2.fastq --threads 60 --memory 600
Reference assessment: QUAST can perform comparisons against the reference genomes used to construct artifiial metagenome. Start with a baseline size of contiges (>1kb).
Command
$ seqmagick convert --min-length 1000 final.contigs.fa megahit_SRR606249.min1000.fasta
QUAST against 62 reference genomes:
Command
This step has already been completed for you and the command does NOT need to be executed again.
$ metaquast.py megahit_SRR606249.min1000.fasta -R ../Shakya_RefGenomes/
Prodigal Gene Caller
Prodigal Gene Caller
First step using prodigal:
File: spades_SRR606249.subset.fasta
(Contains a random subset of contigs from metaSPAdes output. )
Command
-a = output, protein translations -d = output, nucleotide putative coding sequences -i = input -m = treats missing sequence (NNNs) as stop -o = output, genbank format -q = quiet output
$ prodigal -a temp1.orfs.faa -d temp1.orfs.fna -i spades_SRR606249.subset.fasta -m -o temp1.txt -p meta -q
Check temp1.orfs.faa output:
Command
Number of putative proteins.
$ less temp1.orfs.faa
$ grep '>' temp1.orfs.faa | wc -l
Visualize the first 10 header lines:
Command
$ grep '> </ProtocolCommand> 
<ProtocolResult	result=
Use Unix to simplify the header output:
Command
$ cut -f1 -d 
Repeat for nucleotides: 
Command
$ cut -f1 -d 
Determine putative genes for contigs from SPAdes:
Repeat for Megahit and IDBA-UB
Command
$ prodigal -a temp1.orfs.faa -i spades_SRR606249.min1000.fasta -m -o temp1.txt -p meta -q
$ grep “>” temp1.orfs.faa | wc -l
$ cut -f1 -d 
Determining Coverage
Determining Coverage
Determining % of reads and contig coverage using Bowtie2 short read aligner.
Build index file of assembled contigs:
Command
This step has already been completed for you and the command does NOT need to be executed again.
$ bowtie2-build spades_SRR606249.min1000.fasta spades_SRR606249.min1000.bt_index
Perform alignment with trimmed, high-quality reads from SAM file output:
Command
This step has already been completed for you and the command does NOT need to be executed again.
$ bowtie2 -q -1 SRR606249.trim_R1.fastq -2 SRR606249.trim_R2.fastq -x spades_SRR606249.min1000.bt_index --no-unal -S spades_SRR606249.sam -p 35
Utilize featureCounts to determine reads aligned to a contig. Requires a pseudo-input file based on FASTA input. 
Command
The
$ python fastaToSaf.py < spades_SRR606249.min1000.fasta > spades_SRR606249.min1000.saf
$ featureCounts -F SAF -a spades_SRR606249.min1000.saf -o spades_SRR606249.min1000.readcount spades_SRR606249.sam
Custom made Python script - convertReadcountToCoverage.py → can accept multiple readcount inputs to generate a combined coverage matrix:
Command
$ grep