Nov 20, 2019

Public workspaceSteps to Create FASTQ of CCS Overlapping Control SSR - CCS ROI V.6

  • 1United States Department of Agriculture (USDA) Agricultural Research Service (ARS), US Meat Animal Research Center (USMARC), State Spur 18D, Clay Center, NE 68933, USA
  • USMARC Bacterial Systems Biology Working Group
Icon indicating open access to content
QR code linking to this content
Protocol CitationGregory P Harhay 2019. Steps to Create FASTQ of CCS Overlapping Control SSR - CCS ROI. protocols.io https://dx.doi.org/10.17504/protocols.io.9i6h4he
Manuscript citation:
Harhay, G.P., Harhay, D.M., Bono, J.L.et al.A Computational Method to Quantify the Effects of Slipped Strand Mispairing on Bacterial Tetranucleotide Repeats.Sci Rep9,18087 (2019). https://doi.org/10.1038/s41598-019-53866-z
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: November 20, 2019
Last Modified: November 20, 2019
Protocol Integer ID: 30014
Abstract
The virulence and pathogenicity of bacterial pathogens are related to their adaptability to changing environments. One process enabling adaptation is based on minor changes in genome sequence, as small as a few base pairs, within segments of genome called simple sequence repeats (SSRs) that consist of multiple copies of a short sequence (from one to several nucleotides), repeated in series.  SSRs are found in eukaryotes as well as prokaryotes, and variation in them occurs at frequencies up to a million-fold higher than the average bacterial mutation rate through a process of slipped stranded mispairing (SSM) by DNA polymerase during replication.  The characterization of SSR length by standard sequencing methods is complicated by the appearance of length variation introduced during the sequencing process that does not accurately quantify lower-abundance repeat number variants in a population. Here we report a computational approach to correct for process-induced artifacts, validated for tetranucleotide repeats by use of synthetic constructs of fixed, known length. We apply this method to a laboratory culture ofHistophilus somni, prepared from a single colony, and demonstrate that the culture consists of populations of distinct sequence phase and read length variants at individual tetranucleotide SSR loci.

In this protocol we validate the computaitonal approaches presented here by sequenciing chemically synthesized oligos and annealed to a form duplexes. These oligos are slightly shorter version of the AAGC SSR found in CP018802.1.

Protocol Workflow for Creating FASTQ of CCS ROI for Control SSR

Software Requirements
Software Requirements

Software
Bedtools
NAME
http://quinlanlab.org
DEVELOPER

Software
MatLab
NAME
For those without Matlab licences, Matlab code can be run in CodeOcean at this Matlab Compute Capsule


Software
Phobos
NAME
Dr. Christoph Mayer
DEVELOPER

Software
Geneious
NAME
Biomatters Ltd
DEVELOPER

Geneious was used to as wrapper for running sequence mappers, phobos, and sequence extraction

Workflow Summary
Workflow Summary




Acquire Annotated Control SSR Sequences and Genome For Mapping References
Acquire Annotated Control SSR Sequences and Genome For Mapping References
Download from GenBank manually, Geneious, or MatLab function


Run Phobos to Identify SSR and Define SSR target to Synthesize
Run Phobos to Identify SSR and Define SSR target to Synthesize
Run Phobos to identify simple sequence repeats; search for repeats 2-mers to 10-mers in CP018802.1 genome.This Geneious plugin does not provide access to all potentail running modes and defaults to providing repeat unit naming using "normalised alphabetical mode," where the repeat unit reported is independent of strand and phase enabling Phobos to chose the the repeat pattern that comes first in the alphabet.


Commodity chemical synthesis was restricted to oligos <= 100 bp, select SSR that is smaller than 100 bp for further analysis. 100 bp is the practical limit for reasonably priced chemically synthesized oligos.


Repeat UnitMinimumMaximumLengthPercentage PerfectionRepeat Class
AACC1,792,2171,792,466250100.000%tetranucleotide
AATC1,452,5621,452,715154100.000%tetranucleotide
ACTG1,501,3211,501,467147100.000%tetranucleotide
ACTG1,456,0131,456,119107100.000%tetranucleotide
AAGC1,834,0161,834,09479100.000%tetranucleotide

Becase its length was the largest tetranucleotide SSR below the 100 bp, the AAGC SSR locus was selected as the basis to synthesize the SSR control duplex.

Aggregate PacBIo Circular Consensus Sequence (CCS)
Aggregate PacBIo Circular Consensus Sequence (CCS)
Used circular consensus sequence CCS from PacBio RSII

Control SSR Libraries

Control SSR Libraries based on the 79 bp AAGC tetranucleotide SSR found spanning 1834016 - 1834094bp in CP018802. Each oligo had 14 bp 5' flanking region and 11 bp 3' flank, with the flanking regions identical to those found in CP018802. In total, six oligos were chemically synthesized and annealed into 3 duplexes, with 3 bp overhands on the 5'-end to faciltate PacBio CCS library creation.

88 bp duplex, 63 bp SSR, NM4 ( 4 AAGC repeat units removed from 79 bp genomic)
92 bp duplex , 67 bp SSR, NM3 ( 3 AAGC repeat units removed from 79 bp genomic)

96 bp duplex , 71 bp SSR, NM2 ( 2 AAGC repeat units removed from 79 bp genomic)




Map CCS to Control SSR
Map CCS to Control SSR
For Single Control Duplex - 88 bp total length containing a 63 bp AAGC SSR (NM4)

SSR Reference Mapping Parameters: Geneious Assembler, Medium Sensitiivity/Fast, iterate up to 5 times, Map multiple best matched randomly

Input File of Raw CCS is

Download Control_Single_Duplex_63bp_SSR_L_23088_raw.fastqControl_Single_Duplex_63bp_SSR_L_23088_raw.fastq





  • Export alignment to BAM file

Expected result
Download Control_Single_Duplex_63bp_SSR_L_23088_raw_map_AAGC_Nm4.bamControl_Single_Duplex_63bp_SSR_L_23088_raw_map_AAGC_Nm4.bam Download Control_Single_Duplex_63bp_SSR_L_23088_raw_map_AAGC_Nm4.bam.baiControl_Single_Duplex_63bp_SSR_L_23088_raw_map_AAGC_Nm4.bam.bai



For Three Control Duplexes - NM2 96 bp total, 71 bp SSR; NM3 92 bp total, 67 bp SSR; NM4 88 bp total, 63 bp SSR


Input: Download Control_Three_Duplexes_SSR_L_23089_raw.fastqControl_Three_Duplexes_SSR_L_23089_raw.fastq
Run Geneious Assembler with same parameters as with single duplex mapping job, using the NM3 sequence as the reference, to create the following BAM alignment files


Expected result
Download Control_Three_Duplexes_L_23089_raw_map_AAGC_Nm3.bam.baiControl_Three_Duplexes_L_23089_raw_map_AAGC_Nm3.bam.bai Download Control_Three_Duplexes_L_23089_raw_map_AAGC_Nm3.bamControl_Three_Duplexes_L_23089_raw_map_AAGC_Nm3.bam


Run bedtools to identify CCS spanning the control SSR
Run bedtools to identify CCS spanning the control SSR

For CCS from sequencing library consisting of single control duplex, 63 bp AAGC SSR (NM4)

Create BAM file of CCS overlapping SSR ROI using coordinates identified in step 4 and transferred to their respective BED file to be used in combination with with the BAM file. When specifiying position of SSR, allow for 5 bp on each flank. Please keep in mind the BED file convention, the left coodrinate is 0-based while the right coordinate is 1-based.


For selecting CCS mapping to SSR, use BED to define coordinates

Download CP018802_SSR_79bp_AAGC_Nm4.bedCP018802_SSR_79bp_AAGC_Nm4.bed and Control_Single_Duplex_63bp_SSR_L_23088_raw_map_AAGC_Nm4.bam (generated in previous step)

Command
Find CCS that completely overlap 63 bp AAGC SSR (NM4) including 5 bp adjacent non-SSR region on each flank
bedtools intersect -a Control_Single_Duplex_63bp_SSR_L_23088_raw_map_AAGC_Nm4.bam -b CP018802_SSR_79bp_AAGC_Nm4.bed  -F 1.0 -wa > Control_Single_Duplex_63bp_SSR_L_23088_raw_map_intersect_AAGC_Nm4.bam

Expected result
Download Control_Single_Duplex_63bp_SSR_L_23088_raw_map_intersect_AAGC_Nm4.bamControl_Single_Duplex_63bp_SSR_L_23088_raw_map_intersect_AAGC_Nm4.bam

For sequencing library consisting of three control duplexes, 63 bp AAGC SSR in (NM4) + 67 bp AAGC SSR (NM3) + 71 bp AAGC SSR (NM2)


Test to see if approach can recover the different length consitituents from mixture of SSR lengths present in sample.
For selecting CCS completely overlapping the NM3 67 bp SSR + flanking region, use bed to define coordinates.

Input Download CP018802_SSR_79bp_AAGC_Nm3.bedCP018802_SSR_79bp_AAGC_Nm3.bed
and

Control_Three_Duplexes_L_23089_raw_map_AAGC_Nm3.bam (generated in previous step)



Command
Find CCS that completely overlap 67 bp AAGC SSR (NM3) including 5 bp adjacent non-SSR region on each flank
bedtools intersect -a Control_Three_Duplexes_L_23089_raw_map_AAGC_Nm3.bam  -b CP018802_SSR_79bp_AAGC_Nm3.bed  -F 1.0 -wa > Control_Three_Duplexes_L_23089_raw_map_intersect_AAGC_Nm3.bam

Expected result
Download Control_Three_Duplexes_L_23089_map_intersect_AAGC_Nm3.bamControl_Three_Duplexes_L_23089_map_intersect_AAGC_Nm3.bam



Analyze CCS Spanning AAGC SSR and Remove Flanking Sequence
Analyze CCS Spanning AAGC SSR and Remove Flanking Sequence

For CCS sequencing library derived from single spanning 63 bp AAGC SSR (NM4)

  • Use Geneious to view CCS mapping to reference
  • Inspect alignment of CCS mapping to duplex. Note that gap regions between end of SSR region and first adjacent base of both flanking regions defined region of interest (ROI).
  • Some mappers such as BowTie2 tend to place "extra" repeat units in the gap region between the SSR and the first adjacent base to the left of the SSR, while Genious mapper tends to place "extra" repeat units to the right of the SSR, in the gap between the SSR and the first adjacent base
  • For each CCS the Geneious "Extract" function was used to select bases within the ROI to create a new FASTQ file of CCS with bases spanning the ROI.



  • Write the portion of each CCS within the ROI to FASTQ file Download Control_Single_Duplex_63bp_SSR_ROI_NM4.fastqControl_Single_Duplex_63bp_SSR_ROI_NM4.fastq

Similar analysis performed for sequencing library consisting of three control duplexes, 63 bp AAGC SSR in (NM4) + 67 bp AAGC SSR (NM3) + 71 bp AAGC SSR (NM2) mapping to NM3

  • Write the portion of each CCS within the ROI (NM3) to FASTQ file Download Control_Three_Duplexes_L_23089_SSR_ROI_Nm3.fastqControl_Three_Duplexes_L_23089_SSR_ROI_Nm3.fastq




Move CCS within ROI FASTQ to Matlab Compute environment
Move CCS within ROI FASTQ to Matlab Compute environment
Use either Matlab Compute Capsule at Code Ocean or use local envionment. The steps outline below for the single NM4 control duplex with a 63 bp SSR.