Oct 25, 2022

Public workspacePopulation Structure and Phylogenetics

  • 1The Earlham Institute
Icon indicating open access to content
QR code linking to this content
Protocol CitationGraham Etherington 2022. Population Structure and Phylogenetics. protocols.io https://dx.doi.org/10.17504/protocols.io.bretm3en
Manuscript citation:
Etherington GJ, Ciezarek A, Shaw R, Michaux J, Croose E, Haerty W, Palma FD, Extensive genome introgression between domestic ferret and European polecat during population recovery in Great Britain. Journal of Heredity 113(5). doi: 10.1093/jhered/esac038
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
Created: January 13, 2021
Last Modified: October 25, 2022
Protocol Integer ID: 46259
Abstract
The European polecat (Mustela putorius) is a mammalian predator which breeds across much of Europe east to central Asia. In Great Britain, following years of persecution the European polecat has recently undergone a population increase due to legal protection and its range now overlaps that of feral domestic ferrets (Mustela putorius furo). During this range expansion, European polecats hybridised with feral domestic ferrets producing viable offspring. Here we carry out population-level whole genome sequencing on domestic ferrets, British European polecats, and European polecats from the European mainland and find high degrees of genome introgression in British polecats outside their previous stronghold, even in those individuals phenotyped as ‘pure’ polecats.
Structure analysis
Structure analysis
RAxML phylogeny


Run RAxML in two stages. The first is the 'check' and 'parse' options that check the alignment and creates an *raxml.rba binary alignment file. It also estimates the resources needed to run RAxML efficiently (cores and memory)

ALIGNMENT=$1 #An alignment in fasta format

source RAxML-NG-0.9.0
raxml-ng-mpi --check --msa $ALIGNMENT --model GTR+G --prefix T1
raxml-ng-mpi --parse --msa $ALIGNMENT --model GTR+G --prefix T2


Using the resources advised by the previous step, run final step to produce the phylogeny.

source RAxML-NG-0.9.0
raxml-ng-mpi --all --msa T2.raxml.rba --outgroup weasel  --prefix raxml_all --threads 1 --seed 2 --bs-metric fbp,tbe

Treemix phylogeny
The first step is to create a 'clust' file. The clust file contains three columns, where the first and second column both indicate the name of the individual and the third column indicates the taxon name. This can be achieved by using a combination of bcftools and awk.

source bcftools-1.9 bcftools query -l$FILE.vcf.gz | awk'{split($1,pop,"."); print $1"\t"$1"\t"pop[2]}'> mustelids.clust

Next, we need as input a VCF file that has been pruned for linkage- disequilibrium (LD).
The input for this is the 'all_samples_genotyped_snps.vcf' file produced here:

We filtered for an LD of 0.8 over windows of 5Kb

source bcftools-1.9
bcftools +prune -l 0.8 -w 5000 all_samples_genotyped_snps.vcf -Oz -o all_samples_genotyped_snps_ld_0.8.vcf.gz

Remove any missing data

source vcftools-0.1.13
vcftools --gzvcf all_samples_genotyped_snps_ld_0.8.vcf.gz --max-missing 1 --recode --stdout | gzip > all_samples_no_missing.vcf.gz

Convert the VCF file into TreeMix format
The files required can be found here:


source plink-1.90
source vcftools-0.1.13
./vcf2treemix.sh all_samples_no_missing.vcf.gz mustelids.clust

Then run TreeMix with 1 to 5 migration edges (-m parameter)

source treemix-1.13 for i in {1..5} do  treemix -i all_samples_no_missing.vcf.gz -m $i -o all_samples_$i -root weasel -bootstrap -k 500 > treemix_${i}_log & done


To create ADMIXTURE plots for our data, we used the following pipeline:
BCFTOOLS prune for linkage, reformat the into Plink format, run Structure over a sensible range of K, chooseK and plot with Structure


Take the linkage pruned dataset from 'Phylogenetics, Step 3.2, and handle missing genotypes

source plink-1.90
plink --bfile dom_ep_snps_plink_stucture --geno 0.999 --make-bed --out dom_ep_snps_plink_stucture_pruned

Then run on k=2-5

K=$1
source admixture-1.3.0 admixture --cv -j6 dom_ep_snps_plink_stucture_pruned.bed $K
To identify the best value of K, I used the --cv for bootstrap and CV estimates
grep out the CV errors to find the best (highest) value of K

grep "CV" *.out | awk '{print $3,$4}' | cut -c 2-4,6,7-20 > ADMIXTURE.cv.error

Plot all of the Admixture plots together.
Rscript plotADMIXTURE.R -p dom_ep_snps_plink_stucture_pruned -i k5_sample.map -k 5 -l euro,welsh,english,hybrid,domestic