1United States Department of Agriculture (USDA) Agricultural Research Service (ARS), US Meat Animal Research Center (USMARC), State Spur 18D, Clay Center, NE 68933, USA
Harhay, G.P., Harhay, D.M., Bono, J.L.et al.A Computational Method to Quantify the Effects of Slipped Strand Mispairing on Bacterial Tetranucleotide Repeats.Sci Rep9,18087 (2019). https://doi.org/10.1038/s41598-019-53866-z
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: November 20, 2019
Last Modified: November 20, 2019
Protocol Integer ID: 30014
Abstract
The virulence and pathogenicity of bacterial pathogens are related to their adaptability to changing environments. One process enabling adaptation is based on minor changes in genome sequence, as small as a few base pairs, within segments of genome called simple sequence repeats (SSRs) that consist of multiple copies of a short sequence (from one to several nucleotides), repeated in series. SSRs are found in eukaryotes as well as prokaryotes, and variation in them occurs at frequencies up to a million-fold higher than the average bacterial mutation rate through a process of slipped stranded mispairing (SSM) by DNA polymerase during replication. The characterization of SSR length by standard sequencing methods is complicated by the appearance of length variation introduced during the sequencing process that does not accurately quantify lower-abundance repeat number variants in a population. Here we report a computational approach to correct for process-induced artifacts, validated for tetranucleotide repeats by use of synthetic constructs of fixed, known length. We apply this method to a laboratory culture ofHistophilus somni, prepared from a single colony, and demonstrate that the culture consists of populations of distinct sequence phase and read length variants at individual tetranucleotide SSR loci.
In this protocol we validate the computaitonal approaches presented here by sequenciing chemically synthesized oligos and annealed to a form duplexes. These oligos are slightly shorter version of the AAGC SSR found in CP018802.1.
Protocol Workflow for Creating FASTQ of CCS ROI for Control SSR
Run Phobos to Identify SSR and Define SSR target to Synthesize
Run Phobos to Identify SSR and Define SSR target to Synthesize
Run Phobos to identify simple sequence repeats; search for repeats 2-mers to 10-mers in CP018802.1 genome.This Geneious plugin does not provide access to all potentail running modes and defaults to providing repeat unit naming using "normalised alphabetical mode," where the repeat unit reported is independent of strand and phase enabling Phobos to chose the the repeat pattern that comes first in the alphabet.
Commodity chemical synthesis was restricted to oligos <= 100 bp, select SSR that is smaller than 100 bp for further analysis. 100 bp is the practical limit for reasonably priced chemically synthesized oligos.
Repeat Unit
Minimum
Maximum
Length
Percentage Perfection
Repeat Class
AACC
1,792,217
1,792,466
250
100.000%
tetranucleotide
AATC
1,452,562
1,452,715
154
100.000%
tetranucleotide
ACTG
1,501,321
1,501,467
147
100.000%
tetranucleotide
ACTG
1,456,013
1,456,119
107
100.000%
tetranucleotide
AAGC
1,834,016
1,834,094
79
100.000%
tetranucleotide
Becase its length was the largest tetranucleotide SSR below the 100 bp, the AAGC SSR locus was selected as the basis to synthesize the SSR control duplex.
Used circular consensus sequence CCS from PacBio RSII
Control SSR Libraries
Control SSR Libraries based on the 79 bp AAGC tetranucleotide SSR found spanning 1834016 - 1834094bp in CP018802. Each oligo had 14 bp 5' flanking region and 11 bp 3' flank, with the flanking regions identical to those found in CP018802. In total, six oligos were chemically synthesized and annealed into 3 duplexes, with 3 bp overhands on the 5'-end to faciltate PacBio CCS library creation.
88 bp duplex, 63 bp SSR, NM4 ( 4 AAGC repeat units removed from 79 bp genomic)
92 bp duplex , 67 bp SSR, NM3 ( 3 AAGC repeat units removed from 79 bp genomic)
96 bp duplex , 71 bp SSR, NM2 ( 2 AAGC repeat units removed from 79 bp genomic)
Map CCS to Control SSR
Map CCS to Control SSR
For Single Control Duplex - 88 bp total length containing a 63 bp AAGC SSR (NM4)
SSR Reference Mapping Parameters: Geneious Assembler, Medium Sensitiivity/Fast, iterate up to 5 times, Map multiple best matched randomly
Run Geneious Assembler with same parameters as with single duplex mapping job, using the NM3 sequence as the reference, to create the following BAM alignment files
Run bedtools to identify CCS spanning the control SSR
Run bedtools to identify CCS spanning the control SSR
For CCS from sequencing library consisting of single control duplex, 63 bp AAGC SSR (NM4)
Create BAM file of CCS overlapping SSR ROI using coordinates identified in step 4 and transferred to their respective BED file to be used in combination with with the BAM file. When specifiying position of SSR, allow for 5 bp on each flank. Please keep in mind the BED file convention, the left coodrinate is 0-based while the right coordinate is 1-based.
For selecting CCS mapping to SSR, use BED to define coordinates
CP018802_SSR_79bp_AAGC_Nm4.bed and Control_Single_Duplex_63bp_SSR_L_23088_raw_map_AAGC_Nm4.bam (generated in previous step)
Command
Find CCS that completely overlap 63 bp AAGC SSR (NM4) including 5 bp adjacent non-SSR region on each flank
Analyze CCS Spanning AAGC SSR and Remove Flanking Sequence
Analyze CCS Spanning AAGC SSR and Remove Flanking Sequence
For CCS sequencing library derived from single spanning 63 bp AAGC SSR (NM4)
Use Geneious to view CCS mapping to reference
Inspect alignment of CCS mapping to duplex. Note that gap regions between end of SSR region and first adjacent base of both flanking regions defined region of interest (ROI).
Some mappers such as BowTie2 tend to place "extra" repeat units in the gap region between the SSR and the first adjacent base to the left of the SSR, while Genious mapper tends to place "extra" repeat units to the right of the SSR, in the gap between the SSR and the first adjacent base
For each CCS the Geneious "Extract" function was used to select bases within the ROI to create a new FASTQ file of CCS with bases spanning the ROI.
Write the portion of each CCS within the ROI to FASTQ file Control_Single_Duplex_63bp_SSR_ROI_NM4.fastq
Similar analysis performed for sequencing library consisting of three control duplexes, 63 bp AAGC SSR in (NM4) + 67 bp AAGC SSR (NM3) + 71 bp AAGC SSR (NM2) mapping to NM3
Write the portion of each CCS within the ROI (NM3) to FASTQ file Control_Three_Duplexes_L_23089_SSR_ROI_Nm3.fastq
Move CCS within ROI FASTQ to Matlab Compute environment
Move CCS within ROI FASTQ to Matlab Compute environment
Use either Matlab Compute Capsule at Code Ocean or use local envionment. The steps outline below for the single NM4 control duplex with a 63 bp SSR.