Steps to Create FASTQ of CCS Overlapping Control SSR - CCS ROI

Gregory Harhay

Nov 20, 2019

Version 6

Steps to Create FASTQ of CCS Overlapping Control SSR - CCS ROI V.6

DOI

dx.doi.org/10.17504/protocols.io.9i6h4he

Gregory P Harhay¹

¹United States Department of Agriculture (USDA) Agricultural Research Service (ARS), US Meat Animal Research Center (USMARC), State Spur 18D, Clay Center, NE 68933, USA

USMARC Bacterial Systems Biology Working Group

Gregory P Harhay

United States Department of Agriculture (USDA) Agricultural ...

DOI: dx.doi.org/10.17504/protocols.io.9i6h4he

Protocol Citation: Gregory P Harhay 2019. Steps to Create FASTQ of CCS Overlapping Control SSR - CCS ROI. protocols.io https://dx.doi.org/10.17504/protocols.io.9i6h4he

Manuscript citation:

Harhay, G.P., Harhay, D.M., Bono, J.L.et al.A Computational Method to Quantify the Effects of Slipped Strand Mispairing on Bacterial Tetranucleotide Repeats.Sci Rep9,18087 (2019). https://doi.org/10.1038/s41598-019-53866-z

License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Protocol status: Working

We use this protocol and it's working

Created: November 20, 2019

Last Modified: November 20, 2019

Protocol Integer ID: 30014

Abstract

The virulence and pathogenicity of bacterial pathogens are related to their adaptability to changing environments. One process enabling adaptation is based on minor changes in genome sequence, as small as a few base pairs, within segments of genome called simple sequence repeats (SSRs) that consist of multiple copies of a short sequence (from one to several nucleotides), repeated in series.  SSRs are found in eukaryotes as well as prokaryotes, and variation in them occurs at frequencies up to a million-fold higher than the average bacterial mutation rate through a process of slipped stranded mispairing (SSM) by DNA polymerase during replication.  The characterization of SSR length by standard sequencing methods is complicated by the appearance of length variation introduced during the sequencing process that does not accurately quantify lower-abundance repeat number variants in a population.  Here we report a computational approach to correct for process-induced artifacts, validated for tetranucleotide repeats by use of synthetic constructs of fixed, known length.  We apply this method to a laboratory culture ofHistophilus somni, prepared from a single colony, and demonstrate that the culture consists of populations of distinct sequence phase and read length variants at individual tetranucleotide SSR loci.

In this protocol we validate the computaitonal approaches presented here by sequenciing  chemically synthesized oligos and annealed to a form duplexes. These oligos are slightly shorter version of the AAGC SSR found in CP018802.1. 

Protocol Workflow for Creating FASTQ of CCS ROI for Control SSR

Software Requirements

Software
Bedtools
NAME
http://quinlanlab.org
DEVELOPER
https://github.com/arq5x/bedtools2
SOURCE LINK

Software
MatLab
NAME
For those without Matlab licences, Matlab code can be run in CodeOcean at this  Matlab  Compute Capsule


Software
Phobos
NAME
Dr. Christoph Mayer
DEVELOPER

Software
Geneious
NAME
Biomatters Ltd
DEVELOPER

Geneious was used to as wrapper for running sequence mappers, phobos, and sequence extraction

Workflow Summary

Acquire Annotated Control SSR Sequences and Genome For Mapping References

Download from GenBank manually, Geneious,  or MatLab function

https://www.ncbi.nlm.nih.gov/nuccore/CP018802

Run Phobos to Identify SSR and Define SSR target to Synthesize

Run Phobos  to identify simple sequence repeats;  search for repeats  2-mers to 10-mers in CP018802.1 genome.This Geneious plugin does not provide access to all potentail running modes and defaults to providing repeat unit naming using "normalised alphabetical mode,"  where the repeat unit reported is independent of strand and phase enabling  Phobos to chose the the repeat pattern that comes first in the alphabet.

Commodity chemical synthesis was restricted to oligos <= 100 bp, select SSR that is smaller than 100 bp for further analysis.  100 bp is the practical limit for reasonably priced chemically synthesized oligos.


Repeat UnitMinimumMaximumLengthPercentage PerfectionRepeat Class
AACC1,792,2171,792,466250100.000%tetranucleotide
AATC1,452,5621,452,715154100.000%tetranucleotide
ACTG1,501,3211,501,467147100.000%tetranucleotide
ACTG1,456,0131,456,119107100.000%tetranucleotide
AAGC1,834,0161,834,09479100.000%tetranucleotide

Becase its length was the largest tetranucleotide SSR below the 100 bp,  the AAGC SSR locus was selected as the basis to synthesize the SSR control duplex. 

Aggregate PacBIo Circular Consensus Sequence (CCS)

Used circular consensus sequence CCS from PacBio RSII 

Control SSR Libraries

Control SSR Libraries based on the 79 bp AAGC tetranucleotide SSR found spanning 1834016 - 1834094bp in  CP018802.  Each oligo had 14 bp 5' flanking region and 11 bp 3' flank, with the flanking regions identical to those found in CP018802.  In total, six oligos were chemically synthesized and annealed into 3 duplexes, with 3 bp overhands on the 5'-end to faciltate PacBio CCS library creation. 
  
88 bp duplex, 63 bp SSR, NM4 ( 4 AAGC repeat units removed from 79 bp genomic)

92 bp duplex , 67 bp SSR, NM3 ( 3 AAGC repeat units removed from 79 bp genomic)

96 bp duplex , 71 bp SSR, NM2  ( 2 AAGC repeat units removed from 79 bp genomic)

Map CCS to Control SSR

For Single Control Duplex  - 88 bp total length containing a 63 bp AAGC SSR (NM4)

SSR Reference Mapping Parameters: Geneious Assembler,  Medium Sensitiivity/Fast, iterate up to 5 times, Map multiple best matched randomly

Input File of Raw CCS is

Control_Single_Duplex_63bp_SSR_L_23088_raw.fastq 

Export alignment to BAM file

Expected result
Control_Single_Duplex_63bp_SSR_L_23088_raw_map_AAGC_Nm4.bam Control_Single_Duplex_63bp_SSR_L_23088_raw_map_AAGC_Nm4.bam.bai 

For Three Control Duplexes - NM2 96 bp total, 71 bp SSR; NM3 92 bp total, 67 bp SSR; NM4 88 bp total, 63 bp SSR

Input: Control_Three_Duplexes_SSR_L_23089_raw.fastq 
Run Geneious Assembler with same parameters as with single duplex mapping job, using the NM3 sequence as the reference,  to create the following BAM  alignment files

Expected result
Control_Three_Duplexes_L_23089_raw_map_AAGC_Nm3.bam.bai Control_Three_Duplexes_L_23089_raw_map_AAGC_Nm3.bam

Run bedtools to identify CCS spanning the control SSR

For CCS from sequencing library consisting of single control duplex,  63 bp AAGC SSR (NM4)

Create BAM file of CCS overlapping SSR ROI using coordinates identified in step 4 and transferred to their respective BED file to be used in combination with with the BAM file.  When specifiying position of SSR, allow for 5 bp on each flank. Please keep in mind the BED file convention, the left coodrinate is 0-based while the right coordinate is 1-based.  

For selecting CCS mapping to SSR,  use BED to define coordinates

CP018802_SSR_79bp_AAGC_Nm4.bed   and  Control_Single_Duplex_63bp_SSR_L_23088_raw_map_AAGC_Nm4.bam (generated in previous step)

Command
Find CCS that completely overlap 63 bp AAGC SSR (NM4) including 5 bp adjacent non-SSR region on each flank
bedtools intersect -a Control_Single_Duplex_63bp_SSR_L_23088_raw_map_AAGC_Nm4.bam -b CP018802_SSR_79bp_AAGC_Nm4.bed  -F 1.0 -wa > Control_Single_Duplex_63bp_SSR_L_23088_raw_map_intersect_AAGC_Nm4.bam

Expected result
Control_Single_Duplex_63bp_SSR_L_23088_raw_map_intersect_AAGC_Nm4.bam 

For sequencing library consisting of  three control duplexes,  63 bp AAGC SSR in (NM4) + 67 bp  AAGC SSR (NM3) +  71 bp AAGC SSR (NM2)

Test to see if approach can recover the different length consitituents from mixture of SSR lengths present in sample.
For selecting CCS completely overlapping the NM3 67 bp SSR + flanking region, use bed to define coordinates.

Input CP018802_SSR_79bp_AAGC_Nm3.bed  
and 

Control_Three_Duplexes_L_23089_raw_map_AAGC_Nm3.bam (generated in previous step)

Command
Find CCS that completely overlap 67 bp AAGC SSR (NM3) including 5 bp adjacent non-SSR region on each flank
bedtools intersect -a Control_Three_Duplexes_L_23089_raw_map_AAGC_Nm3.bam  -b CP018802_SSR_79bp_AAGC_Nm3.bed  -F 1.0 -wa > Control_Three_Duplexes_L_23089_raw_map_intersect_AAGC_Nm3.bam

Expected result
Control_Three_Duplexes_L_23089_map_intersect_AAGC_Nm3.bam

Analyze CCS Spanning AAGC SSR and Remove Flanking Sequence

For CCS sequencing library derived from single spanning 63 bp AAGC SSR (NM4) 

Use Geneious to view CCS mapping to reference
Inspect alignment  of CCS  mapping to duplex.  Note that gap regions between end of SSR region and first adjacent base of both flanking regions defined region of interest (ROI). 
Some mappers such as BowTie2 tend to place "extra" repeat units in the gap region between the SSR and the first adjacent base to the left of the SSR, while Genious mapper tends to place "extra" repeat units to the right of the SSR, in the gap between the SSR and the first adjacent base
For each CCS the Geneious "Extract" function was used to select bases within the ROI to create a new FASTQ file of CCS with bases spanning the ROI.

Write the portion of each CCS within the ROI to FASTQ file  Control_Single_Duplex_63bp_SSR_ROI_NM4.fastq 

Similar analysis performed for sequencing library consisting of  three control duplexes,  63 bp AAGC SSR in (NM4) + 67 bp  AAGC SSR (NM3) +  71 bp AAGC SSR (NM2) mapping to NM3

Write the portion of each CCS within the ROI (NM3) to FASTQ file  Control_Three_Duplexes_L_23089_SSR_ROI_Nm3.fastq

Move CCS within ROI FASTQ to Matlab Compute environment

Use either Matlab Compute Capsule at Code Ocean or use local envionment. The steps outline below for the single NM4 control duplex with a 63 bp SSR.

Repeat Unit	Minimum	Maximum	Length	Percentage Perfection	Repeat Class
AACC	1,792,217	1,792,466	250	100.000%	tetranucleotide
AATC	1,452,562	1,452,715	154	100.000%	tetranucleotide
ACTG	1,501,321	1,501,467	147	100.000%	tetranucleotide
ACTG	1,456,013	1,456,119	107	100.000%	tetranucleotide
AAGC	1,834,016	1,834,094	79	100.000%	tetranucleotide

Public workspaceSteps to Create FASTQ of CCS Overlapping Control SSR - CCS ROI V.6

Steps to Create FASTQ of CCS Overlapping Control SSR - CCS ROI V.6