Jul 30, 2020

Public workspaceSNP Calling and VCF Filtering Pipeline V.1

  • 1University of Hawaii at Hilo
Icon indicating open access to content
QR code linking to this content
Protocol CitationM Renee Bellinger 2020. SNP Calling and VCF Filtering Pipeline. protocols.io https://dx.doi.org/10.17504/protocols.io.84fhytn
Manuscript citation:
Bellinger, MR, R Paudel, S Starnes, L Kambic, M Kantar, T Wolfgruber, K Lamour, S Geib, S Sim, S Miyasaka, M Helmkampf, M Shintaku. Taro genome assembly and linkage map reveal QTLs for resistance to Taro Leaf Blight. Submitted GigaScience.
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
This protocol was implemented in "Taro Genome Assembly and Linkage Map Reveal QTLs for Resistance to Taro Leaf Blight". Citation: Bellinger, M. Renee, Roshan Paudel, Steven Starnes, Lukas Kambic, Michael Kantar, Thomas Wolfgruber, Kurt Lamour et al. "Taro Genome Assembly and Linkage Map Reveal QTLs for Resistance to Taro Leaf Blight." G3: Genes, Genomes, Genetics (2020). All data files associated with running the protocol are listed in 'Abstract' section of linked to this protocol's repository. For Genot
Created: November 06, 2019
Last Modified: July 30, 2020
Protocol Integer ID: 29543
Keywords: SNPs, GBS, mapping population, vcf, vcf filtering
Abstract
SNP-calling and genotype filtering using bowtie2, samtools, bcftools, and vcftools

This pipeline was used for calling SNPs using GBS data obtained from Taro Leaf Blight resistant mapping population of taro, Colocasia Esculenta mapped to a taro genome assembled from 10x Genomics linked-reads and Oxford Nanopore Technology long-reads.

The genotype filtering is meant to be done in a single job submission on a linux machine (this was run using 20 cores with 120 Gb RAM and takes two to three days).

The file "1_SNP_calling.txt" contains steps for calling SNPs from demultiplexed GBS data and filtering vcf files (these are broken out in "Steps".

The file "2_genome_analysis" contains steps for running NLR-Annotator and Mummer with Nucmer algorithm to generate synteny plots. This is not broken out in steps
Before start
###############
# SNP calling #
###############

To set up your working environment you will need to download taro data from NCBI and
set paths to each program if they are not already in your bash environment.

The SRA files will need to be demultiplexed

The DNA was digested with enzyme PstI; see file Taro_UH_report.pdf for details

The index keys files are:
TaroUH_key.txt.csv.C2YWBACXX_Hawaiian.samp_bar_enz.tab
TaroUH_key.txt.csv.C3008ACXX_P230xP255.samp_bar_enz.tab # this is the 1025 mapping population



#----------------------------------#
# Programs and versions used in pipeline
#----------------------------------#
bowtie2v2.2.4
samtools1.4.1
bcftools-1.2
vcftools-v0.1.14


#----------------------------------#
# DATA FILES
#----------------------------------#

# GENOTYPING INPUT FILES
# Taro GBS data are available from NCBI's Short Read Archive

# (1)
# SRX2754311 --> GBS data for Taro from Hawaii, South Pacific, Palau, and mainland Asia
# the original flow_cell is C2YWBACXX_6_fastq.gz

# (2)
# SRX2754311 --> GBS data for the '1025' Taro mapping population resistant to Taro Leaf Blight
# the original flow_cell is C3008ACXX_8_fastq.gz


# GENOME FILE
# Download Taro genome from NCBI
# Bioproject PRJNA567267
###############
# SNP calling #
###############

# To set up your working environment you will need to download GBS and genome data from NCBI
# Set paths to each program if they are not in your bash environment.

#----------------------------------#
# Programs and versions used for SNP calling
#----------------------------------#
# bowtie2v2.2.4
# samtools1.4.1
# bcftools-1.2

#----------------------------------#
# DATA FILES
#----------------------------------#

# Short-read files that have been demultiplexed (see Guidelines)
# SRX2754311 --> GBS data for Taro from Hawaii, South Pacific, Palau, and mainland Asia
# SRX2754311 --> GBS data for the '1025' Taro mapping population resistant to Taro Leaf Blight

# GENOME FILE
# NCBI Bioproject PRJNA567267
CALL SNPS
CALL SNPS
# copy the genome to a working folder and unzip it to be a "working copy"

cp /your/path/to/genome/genome.gz genome.fasta.gz
gunzip genome.fasta.gz

# index genome
samtools faidx genome.fasta
bowtie2-build genome.fasta genome
# Map demultiplexed GBS reads with bowtie2 or a program of your choice. Repeat this step for all demultiplexed samples
# settings are -q for fastq format, -p for 20 threads, --very-sensitive-local for alignment algorithm, -x for reference, -U name of demultiplexed read file (e.g., P230.R1.fastq). -S is output file "P230.sam"

bowtie2 -q -p 20 --very-sensitive-local -x genome -U P230.R1.fastq -S P230.sam
# Use samtools to convert *sam to *bam format, sort the reads, filter out unmapped reads (-F 4) to save space, and then check mapping stats
# this loop will process all *sam files in your working directory
# note some samtools programs use -o while others use > to collect program output

for i in *sam;
do
samtools view -bS -o $i.bam $i --threads 20
samtools sort $i.bam -o $i.sort.bam --threads 20
samtools view -h -F 4 $i.sort.bam > $i.mapped.bam
samtools flagstat $i.mapped.bam > $i.mapped.sort.stats
done;

#calculate genotype likelihoods and call variants
samtools mpileup -gu -f genome.fasta -q 20 -Q 20 --output-tags DP,AD,ADF,ADR,SP,INFO/AD *mapped.bam | bcftools call -m --output-type v -o draft_Taro.merged_q20Q20.vcf
# compress the *vcf to save space
gzip draft_Taro.merged_q20Q20.vcf

Filter SNPs in *vcf file
Filter SNPs in *vcf file
## I use variables for this script. Would be easier to run the file 1_SNP_calling.txt provided on page with Abstract

## This step remove samples that are not part of the mapping population

A=draft_Taro.merged_q20Q20.vcf.gz

vcftools --gzvcf $A --keep keep_1025_230_255.txt --recode --recode-INFO-all --out 1025_230_255
## Remove SNPs within 5 bases of an INDEL

B=1025_230_255.recode.vcf

bcftools filter -g 5 $B >filter.1.vcf
## Remove indels

C=filter.1.vcf

vcftools --vcf $C --keep keep_1025_230_255 --remove-indels --recode --recode-INFO-all --out 1025_230_255.noindel
## light filter for max missing 50%, minimum depth of 5 (per genotype), bi-allelic calls, minimum quality phred 20
## remove samples that are not the 1025 taro leaf blight mapping population

D=1025_230_255.noindel.recode.vcf

vcftools --vcf $D --max-missing 0.5 --minDP 5 --min-alleles 2 --max-alleles 2 --minQ 20 --recode --recode-INFO-all --out 1025_230_255.noindel.maxmiss.5.minDP5.2alleles.minQ20
## Measure missingness across individuals and create a file listing individuals missing >30% of data

E=1025_230_255.noindel.maxmiss.5.minDP5.2alleles.minQ20.recode.vcf

vcftools --vcf $E --missing-indv --out 30percent

cat 30percent.imiss | awk '{if($5>0.3)print $1}'|grep -v INDV> remove-indv_30perc
## Remove individuals missing 30% data in conjunction with the light filter

D=1025_230_255.noindel.recode.vcf

vcftools --vcf $D --remove remove-indv_30perc --max-missing 0.5 --minDP 5 --min-alleles 2 --max-alleles 2 --minQ 20 --recode --recode-INFO-all --out 1025_230_255.noindel.maxmiss.5.minDP5.2alleles.minQ20.n86ind
## Measure missingness across loci and create a file "exclude_miss20p". Apply this file in step 15.

F=1025_230_255.noindel.maxmiss.5.minDP5.2alleles.minQ20.n86ind.recode.vcf

vcftools --vcf $F --keep keep_1025_230_255 --missing-site --out 1025_230_255.noindel.maxmiss.5.minDP5.2alleles.minQ20

cat 1025_230_255.noindel.maxmiss.5.minDP5.2alleles.minQ20.lmiss|awk '{if($6>=0.20)print $1, $2}'>exclude_miss20p
## Apply stringent filter:
## Exclude individuals missing 30% of data and loci that are missing 20% of data
## some filters are redundant in case steps are run out of order
## apply minimum genotypic depth of 8

G=1025_230_255.noindel.recode.vcf

vcftools --vcf $G --remove remove-indv_30perc --exclude-positions exclude_miss20p --max-missing 0.8 --minDP 8 --min-alleles 2 --max-alleles 2 --minQ 20 --recode --recode-INFO-all --out 1025_230_255.noindel.maxmiss.8.minDP8.2alleles.minQ20.n86
## Apply minor allele frequency filter

vcftools --vcf 1025_230_255.noindel.maxmiss.8.minDP8.2alleles.minQ20.n86_variable_parental.recode.vcf.recode.vcf --maf 0.012 --recode --recode-INFO-all --out 1025_230_255.noindel.maxmiss.8.minDP8.2alleles.minQ20.n86_variable_parents_maf_0.012

# filtered file is: 1025_230_255.noindel.maxmiss.8.minDP8.2alleles.minQ20.n86_variable_parents_maf_0.012.recode.vcf