Oct 31, 2018

Public workspaceNanoAmpli-Seq - Bioinformatics Workflow

  • 1Ex-Uni of Glasgow/Birmingham/Aberystwyth;
  • 2University of Glasgow;
  • 3Northeastern University
  • Pinto Lab
Icon indicating open access to content
QR code linking to this content
Protocol CitationSzymon T STC Calus, Umer Zeeshan Ijaz, Ameet Pinto 2018. NanoAmpli-Seq - Bioinformatics Workflow. protocols.io https://dx.doi.org/10.17504/protocols.io.u25eyg6
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol in our group and it is working very well.
Created: October 26, 2018
Last Modified: October 31, 2018
Protocol Integer ID: 17213
Guidelines
Test data is available on the European Nucleotide Archive (ENA) website

Safety warnings
The highest accuracy of the data is being achieved when only 1D2 reads are used with INC-Seq, chopSEQ and nanoCLUST algorithms.

We tested 1D data as well with NA-S bioinformatics workflow, however, noticed increase in overall error rates and presence of false positive OTUs - validated on mock samples.

We do not recommend using 1D data for high accuracy profiling i.e. clinical samples. However, 1D reads can be used for research purposes e.g. development of correction algorithms etc.
Before start
Make sure all the necessary programs and dependencies are installed on your PC or server and work correctly.

Download and install all the required software.
Download and install all the required software.

Software
Albacore
NAME
Linux
OS
Oxford Nanopore Tech.
DEVELOPER

Software
INC-Seq
NAME
Linux
OS
Genome Institute of Singapore
DEVELOPER

Software
chopSeq
NAME
Linux
OS
University of Glasgow
DEVELOPER

Software
nanoCLUST
NAME
Linux
OS
University of Glasgow
DEVELOPER

Basecalling of raw nanopore data with Albacore software.
Basecalling of raw nanopore data with Albacore software.

Command
Raw data (HDF5) generated with MinKNOW has to be basecalled with Albacore v2.3.3 (or newer) software. The output of the basecalling should be in FASTA format. Further analysis requires 1D2 data only so, full_1dsq_basecaller.py algorithm must be used.
# Program requires input data (-i), version of the flow cell (-f),
# version of the sequencing kit (-k), output file (-o), 
# amount of cores used for analysis (-t) and saving directory (-s).

/home/opt/.pyenv/versions/3.5.0/bin/full_1dsq_basecaller.py -i data/ -f FLO-MIN107 -k SQK-LSK308 -o fasta -t 20 -s .

Consensus calling of long 16S rRNA concatemerized reads with use of the INC-Seq algorithm.
Consensus calling of long 16S rRNA concatemerized reads with use of the INC-Seq algorithm.

Command
The INC-Seq software requires basecalled data (e.g. Albacore) from Step 2. Correction of the data with INC-Seq algorithm uses only 1D2 data and is divided into two main steps: 1) Identification of segments made of 16S rRNA genes. 2) Anchor alignment of concatamerised amplicons and consensus calling with PBDAGCon. Corrected reads have got ~98% accuracy and can be directly used as an input for chopSEQ software. (Linux)
# Export all necessary PATH's for the required programs.
# These PATH's are specific to our cluster and may differ
# to yours, depending on where you have installed these programs.

export PYENV_ROOT="/home/opt/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init -)"
export PYTHONPATH=/home/opt/INC-Seq/utils:$PYTHONPATH
export PATH=/home/opt/pacb/bin:$PATH
export PATH=/home/opt/pbdagcon/src/cpp:$PATH
export PATH=/home/opt/ncbi-blast-2.2.28+/bin:$PATH
export PATH=/home/opt/INC-Seq:$PATH
export PATH=/home/opt/.pyenv/versions/3.4.0/bin:$PATH


# INC-Seq consensus calling requires input data (-i),
# aligner (-a) e.g. poa, output file name (-o),
# minimum number of concatemers (--copy_num_thre) and --iterative.

inc-seq.py -i input.fasta -a poa -o incseq.fasta --iterative --copy_num_thre 3

Correction of wrongly oriented reads and size filtration with a chopSeq algorithm.
Correction of wrongly oriented reads and size filtration with a chopSeq algorithm.

Command
The chopSeq requires INC-Seq corrected data from Step 3. Correction of the data is divided into multiple steps: 1) Identification of forward and reverse primers (e.g. 8F and 1387R) with pairwise2 aligner. 2) Re-orientation of incorrectly concatamerised reads and removal of tandem repeats recognised with use of etandem (EMBOSS) and subsequent merging of reads. 3) Size filtration with Biopython. Now reads are qualified for nanoClust OTU binning and consensus calling. (Linux)
# Algorithm requires input data (-i) from previous step,
# forward (-f) and reverse (-r) primer sequence,
# lower (-l) and maximum (-m) size filtration range,
# and new file destination (> new_file.fasta),
# while verbosity (-v) mode is optional.

chopSEQ.py -i incseq.fasta -f "AGRGTTTGATCMTGGCTCAG" -r "GGGCGGWGTGTACAAGRC" -l 1250 -m 1500 -v > chopseq.fasta

Read binning and generation of OTUs with a nanoCLUST algorithm.
Read binning and generation of OTUs with a nanoCLUST algorithm.

Command
The nanoCLUST requires chopSEQ corrected data from Step 4. Correction of the data is divided into multiple steps: 1) Data is partitioned (i.e. 1-450,451-900, 901-1300bp). 2) Reads from each partition are grouped according to 97% similarity with VSEARCH. 3) VSEARCH partition dereplication, singleton removal and binning are performed on split data. 4) Optimal read sections are used for clustering. 5) MAFFT G-INS-i is used for within OTU alignment and consensus calling of data. 6) Consensus sequences are generated (~99.5% accuracy). 7) Abundance table is generated. (Linux)
# Export all necessary PATH's for the required programs.
# These PATH's are specific to our cluster and may differ
# to yours, depending on where you have installed these programs.

export PATH=/home/opt/vsearch/bin:$PATH
export PATH=/home/opt/mafft-7.273-without-extensions/core/bin:$PATH
export MAFFT_BINARIES=/home/opt/mafft-7.273-without-extensions/core/libexec/mafft


# Provide chopSeq corrcted data (-i) and window split
# range (-s) for read partitioning and output folder (-o).

nanoCLUST.py -i chopSEQ.fasta -s 0,450,451,900,901,1300,-1 -o nanoclust_output/