Sep 21, 2020
  • 1Centro de Investigación y de Estudios Avanzados del IPN (Cinvestav);
  • 2Instituto Nacional de Medicina Genómica
  • Whole genome variation in 27 Mexican indigenous populations, demographic and biomedical insights
Icon indicating open access to content
QR code linking to this content
Protocol CitationJudith Ballesteros Villascan, Israel Aguilar Ordoñez, Fernando Pérez-Villatoro 2020. VCF2PCP. protocols.io https://dx.doi.org/10.17504/protocols.io.bkwbkxan
Manuscript citation:
Aguilar-Ordoñez I, Pérez-Villatoro F, García-Ortiz H, Barajas-Olmos F, Ballesteros-Villascán J, González-Buenfil R, Fresno C, Garcíarrubio A, Fernández-López JC, Tovar H, Hernández-Lemus E, Orozco L, Soberón X, Morett E (2021) Whole genome variation in 27 Mexican indigenous populations, demographic and biomedical insights. PLoS ONE 16(4): e0249773. doi: 10.1371/journal.pone.0249773
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: September 05, 2020
Last Modified: September 21, 2020
Protocol Integer ID: 41635
Abstract
Nextflow pipeline that runs and plots admixture and smartpca from a compressed VCF.
Guidelines

Installation

Download VCF2PCP from Github repository:
git clone https://github.com/jbv2/VCF2PCP.git


Compatible OS*:

* VCF2PCP may run in other UNIX based OS and versions, but testing is required.

Software Requirements:
Software
bcftools
NAME

Software
plink
NAME

Software
Eigensoft
NAME

Software
Admixture
NAME

Software
Nextflow
NAME

Software
Plan9
NAME

Software
R
NAME

Materials

Pipeline Inputs

  • A compressed VCF file with extension '.vcf.gz'.
Example line(s):
##fileformat=VCFv4.2 #CHROM POS ID REF ALT QUAL FILTER INFO chr21 5101724 . G A . PASS AC=1;AF=0.00641;AN=152;DP=903;ANN=A|intron_variant|MODIFIER|GATD3B|ENSG00000280071|Transcript|ENST00000624810.3|protein_coding||4/5|ENST00000624810.3:c.357+19987C>T|||||||||-1|cds_start_NF&cds_end_NF|SNV|HGNC|HGNC:53816||5|||ENSP00000485439||A0A096LP73|UPI0004F23660|||||||chr21:g.5101724G>A||||||||||||||||||||||||||||2.079|0.034663|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| chr21 5102165 rs1373489291 G T . PASS AC=1;AF=0.00641;AN=140;DP=853;ANN=T|intron_variant|MODIFIER|GATD3B|ENSG00000280071|Transcript|ENST00000624810.3|protein_coding||4/5|ENST00000624810.3:c.357+19546C>A|||||||rs1373489291||-1|cds_start_NF&cds_end_NF|SNV|HGNC|HGNC:53816||5|||ENSP00000485439||A0A096LP73|UPI0004F23660|||||||chr21:g.5102165G>T||||||||||||||||||||||||||||5.009|0.275409||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

  • A file that contains the name of samples and the group that belongs to, separated by " ".(samples.txt)
Example line(s):
sample1 Zoque
sample2 PEL
sample3 PEL
sample4 CHB
...

  • A file that contains fields: sample, pop, and region separated by tabs. (tag_data.tsv). It helps for regions like north, central, and south.
Example line(s):
sample pop region
sample2 PEL PEL
sample3 PEL PEL
sample4 CHB CHB
...

Before start

Test

To test VCF2PCP execution using test data, run:
./runtest.sh

Your console should print the Nextflow log for the run, once every process has been submitted, the following message will appear:
======
vcf2pcp: Basic pipeline TEST SUCCESSFUL
======

VCF2PCP results for test data should be in the following file:
VCF2PCP/test/results/VCF2PCP-results

Usage

To run VCF2PCP go to the pipeline directory and execute:
nextflow run vcf2pcp.nf --vcffile <path to input 1> [--output_dir path to results ]

For information about options and parameters, run:
nextflow run vcf2pcp.nf --help

Before Nextflow
Before Nextflow
Format and select samples
Removes unused contigs in the header and keeps given samples.

Dependencies:
Software
bcftools
NAME

Pre-processing
Pre-processing
Split chromosomes
Split chromosomes from a compressed VCF file.

Dependencies:
Software
bcftools
NAME

Simplify and remove LD
SimplifyVCCF to keep only INFO/AF and GT and removes LD variants with bcftools +prune. Please, consider window for LD pruning is given in bp.
Note
a) Remove variants in LD.
b) Simplify VCF to keep only INFO/AF and GT

Dependencies:
Software
bcftools
NAME


Rejoin VCF
Concatenate multiple VCF of different chromosomes.

Dependencies:
Software
bcftools
NAME

VCF to PLINK
Convert VCF to plink and filters MAF.
Note
a) Convert VCF to PLINK file.
Filter MAF with PLINK.

Dependencies:
Software
plink
NAME

Make pedind
Make pedind file for running smartpca by using tagger.R
  • tagger.R is a tool that takes columns of fam file and the groups of samples and makes pedind file.

Dependencies:
  • tagger.R
Make pop info
Make popinfo file for plotting admixture results by using make_popinfo.R
  • make_popinfo.R is a tool that takes columns of fam file and the groups of samples and makes popinfo file.

Dependencies:
  • make_popinfo.R
Core-processing
Core-processing
Make par file for smartpca
Make par file to run smartpca, runs it and take best snps a-nd Tracy-Widom statistics from stdout.
Note
a) Write par file
b) Run smartpca
c) Get best snps

Dependencies:
Software
Eigensoft
NAME

Keep autosomes
Keep only autosomal chromosomes for running admixture, as it is said in its documentation.

Dependencies:
Software
plink
NAME

Run admixture
Run admixture with K 2:9 by default and gathers all logs.

Dependencies:
Software
Admixture
NAME

Pos-processing
Pos-processing
Parallel coordinate plot
Get number of snps for PCA, and the number of statistically significant PCs and plots it by using parallel_plotter.R
  • parallel_plotter.R is a tool for making parallel coordinates plots.
Note
a) Get the number of snps for PCA, and number of statisttically significnt PCs.
b) Reformat the evec file to replace spaces.
c) Run Rscript


Dependencies:
  • parallel_plotter.R

Final Output:
Expected result



Regional PCA
Plot PCA of PC1 vs all PCs and makes PCP by region by using plotter.R
  • plotter.R is a tool for making parallel coordinates plot by region.

Dependencies:
  • plotter.R

Final Output:
Expected result




Plot Admixture
Plot all admixture results by using admixture_plotter.R
  • admixture_plotter.R is a tool for plotting each admixture result.

Dependencies:
  • admixture_plotter.R

Final Output:
Expected result



Plot CVS
Plot CVS from admixture by using plotter.R
  • plotter.R is a tool for plotting each CV from admixture results.

Dependencies:
  • plotter.R

Final Output:
Expected result





Gather admixture plots
Plot all admixture results in one file by using plotter.R
  • plotter.R is a tool for plotting k 2:9 from admixture results.

Dependencies:
  • plotter.R

Final Output:
Expected result



Kmeans
Get k means from significant PCs using kmean.R
  • kmean.R is a tool for making groups (k) fro significant PCs.
Note
a) Get the number of snps for PCA, and the number of statistically significant PCs
b) Reformat the evec file to replace spaces
c) Run Rscript

Dependencies:
kmean.R

Final Output:
Expected result