License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: July 28, 2022
Last Modified: May 08, 2023
Protocol Integer ID: 67838
Keywords: lower genetic diversity, genetic diversity, substantial genetic distance, endangered banff springs snail, indicative of unique gene pool, genomic variation, functional genomic diversity, genetic clusters among population, future studies on the functional genomic diversity, diversity for effective conservation management, unique gene pool, conservation of biodiversity, patterns of cryptic species, endangered species, species, genetic variation, determining cryptic species, biodiversity, genetic cluster, microgeographic population genomic structure suggestive, species status, common physella gyrina, genetic drift within environment, genome sequencing, population, genetic drift, utility of genomic, cryptic species, genomic, risk species, whole genome sequencing, genome, population structure, single nucleotide polymorphism, using ecology, effective conservation management, specialized thermal springs in banff national park, sequencing, diversity, gene, physella johnsoni, conservation
Abstract
Determining cryptic species and diversity in at-risk species is necessary for the understanding and conservation of biodiversity. The endangered Banff Springs Snail, Physella johnsoni, inhabits seven highly specialized thermal springs in Banff National Park, Alberta, Canada. However, it has been difficult to reconcile its species status to the much more common Physella gyrina using ecology, morphology and genetics. Here we used pooled whole-genome sequencing to characterize genomic variation and structure among five populations of P. johnsoni and three geographical proximate P. gyrina populations. By comparing over two million single nucleotide polymorphisms, we detected substantial genetic distance (pairwise FST of 0.27 to 0.44) between P. johnsoni and P. gyrina, indicative of unique gene pools. Genetic clusters among populations were found for both species, with up to 10% for P. johnsoniand 30% for P. gyrina of genetic variation being explained by population structure. P. johnsoni was found to have lower genetic diversity compared to P. gyrina, however, no patterns of were observed between genetic diversity and population minimums. Our results confirm that designation of P. johnsoni as an endangered species is warranted and that both P. johnsoniand P. gyrina exhibit microgeographic population genomic structure suggestive of rapid local adaptation and/or genetic drift within environments. This study showcases the utility of genomics to resolve patterns of cryptic species and diversity for effective conservation management. Future studies on the functional genomic diversity of P. johnsoni populations are needed to test for the possible role of selection within this thermal spring environment.
Project information and background
Project information
Below is the pipeline I used to analyze my pool-seq data (Stanford 2019). For my project, I analyzed five populations of Physella johnsoni and three populations of Physella gyrina of 20 to 40 individuals per population. These were sequenced by Genome Quebec on the Illumina HiSeq X, aiming for about 40x coverage (determined to be almost double that in actuality). Each population consisted of half a lane worth of data (total of four lanes). I included the SLURMs because it gives an idea of how long each thing took to run for my files.
Helpful SLURM commands
sbatch SLURM_name #will submit job to queue
squeue -u your_user_name #will give you the job's status
scancel job_ID_number #will cancel the job
acct -j job_ID_number --format=JobID,JobName,MaxRSS,Elapsed #Gives JobID (kind of redundant), the name of the job, the memory and the time it took.
Trimming and cleaning reads
Convert BCL to Fastq
Can skip this step if sequencing company has returned fastq files and you are happy with the barcode mismatch parameters they used.
Trimmomatic - cleaning and filtering low-quality reads and removing adaptors
phred : can be 64. Depends on your sequencing.
threads : change to the number of threads you have available on your cluster. We used 16, half of a node, but I think we could have used more. Just make sure to match this to what you are asking for in your SLURM.
ILLUMINACLIP : path to the adaptor file you made.
2 : seed mismatches
30 : is how accurate the match between the two adaptor ligated reads must be
10 : SimpleClip Threshold
CROP : we were having an issue in some of our reads where Trimmomatic just wasn't detecting the repetitive sequences in the last 15 nucleotides. Hence the hard crop to trim the last 15 nucleotides off each read.
LEADING : trim the leading nucleotides if they fall under Q5
TRAILING : trim the trailing nucleotides if they fall under Q5
SLIDINGWINDOW :
5:20 : look at 5 bases at a time and trim if the average Q is less than 20
Contamination removal (skip if high quality reference genome)
The steps below allow you to filter your sequences against databases of different potential contaminants. I didn't find it removed many sequences. I made the databases a variety of ways as I found out one way wouldn’t work for every group of organisms.
Databases for Archaea and Algae (Charophyceae, Chlorophyta, Cryptophyta, Eustigmatophyceae and Klebsormidiophyceae)
Create a database from NCBI of the sequences you would like to remove. You need to get the GI list
from NCBI (as per http://johnstantongeddes.org/aptranscriptome/2013/12/31/notes.html or https://www.biostars.org/p/6528/). You will need to install the newest version of NCBI BLAST+ (https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download).
I have heard that moving forward NCBI will be switching to using taxid to create databases rather than GI list but upon publishing this GI list were still being used.
In this example, I am using Archaea. Key thing is that the nt database and your GI list must exist
in the same directory. It will be messy, which is why I put the “X” in front of the files I was making so
that all of them were at the bottom.
/path/to/ncbi_directory/ncbi-blast-2.7.1+/bin/blastdb_aliastool -db nt -gilist Archaea.gi
-dbtype nucl -out X_nt_archaea -title "database for Archaea"
Once you’ve created the .nal file, you need to convert the database to .fasta file. It’s small so I ran it in the SLURM.
I downloaded all of the full bacterial genomes off of NCBI (I think there was about 10,000 of them?) and all assembly levels for bacteria that have been found in the thermal springs onto a computer (many Gb). I then used Globus to put them on to Cedar. https://www.ncbi.nlm.nih.gov/assembly/?term=bacteria. You will need to unzip any files that are zipped (including your query sequences) because DeconSeq can’t use zipped files.
Can use:
for file in *.gz #loop through all files with this file extension
do
gunzip $file #unzip them
done #when it's finished, stop.
This will had to be done for all of the bacterial genomes. It took 10+ hours so I would consider submitting it as a SLURM.
Then I concatenated the bacterial genomes together, using:
For the bacterial database (because it is over 200Gb) you will need to first split it into managable chunks before BWA can use them. Can use fasta splitter (http://kirill-kryukov.com/study/tools/fasta-splitter/).
The files need to be under 3Gb each, so split it to as many chunks as you need. I did 100.
You will have to go into the installed DeconSeq directory and set up the configure file DeconSeqConfig.pm. You will just need to change the database directory and out directory and the what files are accessed in the “use constant DBS”.
Split the query sequences
DeconSeq is unable to handle big datasets (e.g. 50+ Gb files that I was using)
Will need to use fastq splitter (http://kirill-kryukov.com/study/tools/fastq-splitter/)
I don’t think I ran this in a SLURM and just used the console. If it fails. . . put it in a SLURM.
For DeconSeq you need to choose the identity (-i) which is the percent match between your query sequence and the database and the coverage (-c) which is the amount of the sequence aligns.
I went with the parameters they used in their paper and based on what I had seen other people do, which was 94% identity (-i 94) and 90% to 95% coverage (-c 90 or -c 95).
DeconSeq was never designed for paired-end sequencing. Therefore it will process each read direction separately and this causes sequences to be removed in one direction and not the other.
Firstly, concatenate the 50 (or more or less depending on what you split your in files to) of clean files. I also did this for the .cont files because I wanted to keep them and see how many were removed.
Then you can use this person’s script found at: https://github.com/linsalrob/fastq-pair
Step 1: Clone or download - copy URL
Step 2: In Cedar or ARC or whatever cluster type: git clone -should see a directory called “fastq-pair”
Step 3: gcc fastq-pair/*.c -o fastq_pair
Step 4: Should see an exectuable script called “fastq_pair”
Running it is super simple.
path/to/fastq_pair -t VALUE path/to/file_R1.fastq path/to/file_R2.fastq
Where VALUE is roughly the number of sequences in your file. This is the setting of the hash table size. For some reason I don’t know why, I couldn’t get this to run as a SLURM. It would make the outfiles (if I gave it over 50Gb) but it would run for far longer than it runs in the terminal and just never finish. I ended up running it in ARC’s console because Cedar doesn’t have enough resources allocated to their console.
It makes four output files. R1 paired and unpaired and R2 paired and unpaired. Then you will need to zip the files back up
Align reads - BWA (Burrows-Wheeler Aligner)
Aligning sequences to the reference genome for Physella acuta (GenBank RDRX00000000) (Schultz et al. 2020)