Mar 10, 2026

Public workspaceGermVarX: An Automated Workflow for Joint Germline Variant Exploration in Whole-Exome Sequencing Cohorts V.1

  • Nguyen Thi Phuong Thao1,
  • Nguyen Duc Dung1,
  • Mai Van Thuy2,
  • Nguyen Khoi Dung3,
  • Nguyen Dang Tung4,
  • Truong Thi Minh Ngoc1,
  • Ha Hong Hanh5,
  • Tran Thi Ha Trang6
  • 1Institute of Information Technology, Vietnam Academy of Science and Technology, Hanoi, Vietnam;
  • 2Hanoi University of Public Health, Hanoi, Vietnam;
  • 3Electric Power University, Hanoi, Vietnam;
  • 4Post and Telecommunications Institute of Technology, Hanoi, Vietnam;
  • 5Institute of Biology, Vietnam Academy of Science and Technology, Hanoi, Vietnam;
  • 6VinUni Bigdata Research Institute, VinUniversity, Hanoi, Vietnam
  • Nguyen Thi Phuong Thao: Corresponding author;
Icon indicating open access to content
QR code linking to this content
Protocol CitationNguyen Thi Phuong Thao, Nguyen Duc Dung, Mai Van Thuy, Nguyen Khoi Dung, Nguyen Dang Tung, Truong Thi Minh Ngoc, Ha Hong Hanh, Tran Thi Ha Trang 2026. GermVarX: An Automated Workflow for Joint Germline Variant Exploration in Whole-Exome Sequencing Cohorts. protocols.io https://dx.doi.org/10.17504/protocols.io.3byl48kr8vo5/v1Version created by Nguyen Thi Phuong Thao
Manuscript citation:
PONE-D-25-50266
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: January 20, 2026
Last Modified: March 10, 2026
Protocol Integer ID: 238948
Keywords: Bioinformatics Pipeline, Automated Workflow, Whole-Exome Sequencing, Germline Variants, Joint Variant Calling, Cohort Analysis, source workflow for joint germline variant discovery, automated workflow for joint germline variant exploration, exome sequencing cohorts germvarx, joint germline variant discovery, joint germline variant exploration, key feature of germvarx, germvarx, enabling simultaneous genotyping, simultaneous genotyping of multiple sample, gatk haplotypecaller, joint genotyping, variant effect predictor, implementation of joint variant calling, cohort, joint variant calling, task execution across diverse computing environment, downstream analysis, variant caller, wes cohort study, unified reporting, exploration in wes cohort study, automated workflow
Funders Acknowledgements:
Vietnam Ministry of Science and Technology
Grant ID: KC4.0-37/19-25
Vietnam Academy of Science and Technology
Grant ID: CSCL02.06/24-25
Abstract
GermVarX is an open-source workflow for joint germline variant discovery and exploration in WES cohort studies. A key feature of GermVarX is its implementation of joint variant calling, enabling simultaneous genotyping of multiple samples to produce a single, high-confidence multi-sample VCF, optimized for downstream analyses. Implemented in Nextflow DSL2 with Docker, it supports fully automated execution, a modular architecture, and parallelized task execution across diverse computing environments, including workstations, HPC clusters, and cloud platforms. The workflow integrates two state-of-the-art variant callers—GATK HaplotypeCaller and DeepVariant—with joint genotyping performed via GATK or GLnexus. To increase reliability, GermVarX supports consensus generation between callers, coupled with sample- and cohort-level quality control, functional annotation using the Variant Effect Predictor (VEP), and unified reporting through MultiQC. In addition, it provides PLINK-compatible outputs, facilitating seamless integration with statistical and association analyses.
Troubleshooting
Prepare the Computational Environment
GermVarX is distributed as a Nextflow pipeline with Docker container support.

To set up the environment:

Install Docker

Follow the installation instructions for your platform:


Install Nextflow

GermVarX requires Nextflow (version ≥ 24).


Download the GermVarX Pipeline
Clone the source code from the official GitHub repository:


Set Up Docker Images
Pull the required pre-built images and build the GermVarX custom image:

# PLINK 1.9
docker pull quay.io/biocontainers/plink:1.90b6.21--h516909a_0

# GATK 4.2.6.1
docker pull broadinstitute/gatk:4.2.6.1

# DeepVariant 1.6.1
docker pull google/deepvariant:1.6.1

# VEP 114.1
docker pull ensemblorg/ensembl-vep:release_114.1

# GLnexus 1.4.1
docker pull quay.io/biocontainers/glnexus:1.4.1--h17e8430_5

# GermVarX pipeline (custom image)
docker build -t germvarx-pipeline:0.1 ./docker/germvarx-pipeline

Prepare Testing Data and Resources
Reference resources

Create a directory (e.g., named ReferenceDir) and download the required reference files as listed below. Some databases (e.g., dbSNP, dbNSFP, gnomAD, CADD) have multiple releases; you may choose an alternative version depending on your analysis preference and compatibility requirements. Be sure to download the corresponding index file as well.
Reference Indexing

Index the reference genome for downstream analysis:

cd ReferenceDir

# FASTA index
samtools faidx Homo_sapiens_assembly38.fasta

# Sequence dictionary required by GATK
gatk CreateSequenceDictionary \
-R Homo_sapiens_assembly38.fasta \
-O Homo_sapiens_assembly38.dict

# BWA-mem2 index
bwa-mem2 index Homo_sapiens_assembly38.fasta

Configure Nextflow Parameters
The GermVarX source code contains the following structure:

src/
main.nf
modules/
pipeline/
scripts/
nextflow.config
docker/
germvarx-pipeline/
configuration/
params.config
docker.config

src/

Contains modularized workflow processes built according to the latest Nextflow structure. Processes are grouped into sub-workflows based on input type.

nextflow.config

Defines available custom profiles for execution.

docker/

Contains the Dockerfile for the GermVarX custom image.

configuration/

Contains configuration files:

  • docker.config

Defines the Docker containers used by the pipeline.
  1. Block 1: global container configuration
  2. Block 2: process-specific container mapping

Note
When mounting directories using --volume, ensure the paths inside the container match local paths for simpler parameter configuration.

  • params.config

Defines pipeline parameters (see Table 2).
AB
ParameterDescription
params.py_ABfilterPath to custom AB filter script.
params.outputDirOutput directory for all pipeline results.
params.single_sample_modeRun per-sample mode (true) or joint-calling mode (false).
params.hard_filterApply GATK Hard Filters instead of VQSR.
params.use_genomicsdbUse GenomicsDB for joint-calling (instead of CombineGVCFs).
params.inputDirDirectory containing FASTQ files.
params.allFastqPattern to capture all FASTQ files.
params.readsPaired-end FASTQ pattern (_1 / _2 suffix).
params.inputBAMInput BAM file (if starting from aligned BAM).
params.inputGVCF_gatkInput GATK-generated GVCFs
params.inputGVCF_dvInput DeepVariant-generated GVCFs.
params.inputVCF_gatkInput GATK VCF file
params.inputVCF_dvInput DeepVariant VCF file.
params.output_typeIf set, pipeline stops early and outputs only BAM or GVCF stage.
params.readGroupLibraryRead group library name (e.g., WES).
params.readGroupPlatformSequencing platform (e.g., ILLUMINA).
params.readGroupUnitRead group unit (lane or flowcell ID).
params.exomeRegionsBedTarget capture BED file for WES analysis.
Reference genome information
params.refDirDirectory containing reference genome resources.
params.alignmentRefReference FASTA for alignment and variant calling.
params.millsRefIndel resource for BQSR and variant filtering.
params.dbSNPRefdbSNP reference VCF.
params.hapmapRefHapMap reference VCF for VQSR.
params.omniRefOmni reference VCF for VQSR.
params.Ref1kG1000 Genomes SNP reference.
params.knownIndelsKnown indel sites for BQSR.
Variant annotation resources
params.vepCacheDirVEP cache directory.
params.pluginsDirDirectory containing annotation plugin resources.
params.vepPluginsDirSubdirectory for VEP plugins.
params.dbNSFPdbNSFP annotation database.
params.gnomADgnomAD exome frequency database.
params.clinvarClinVar clinical variant database.
params.caddIndelCADD indel annotation resource.
params.caddsnvsCADD SNV annotation resource.
HardFilter thresholds (if hard_filter = true)
params.indelQUALHard filter: minimum QUAL for indels.
params.indelINFO_QDHard filter: Quality by Depth (QD) for indels.
params.indelINFO_ReadPosRankSumHard filter: Read position bias for indels.
params.indelINFO_FSHard filter: Fisher strand bias for indels.
params.snpQUALHard filter: minimum QUAL for SNPs.
params.snpINFO_QDHard filter: Quality by Depth (QD) for SNPs.
params.snpINFO_MQHard filter: Mapping Quality for SNPs.
params.snpINFO_ReadPosRankSumHard filter: Read position bias for SNPs.
params.snpINFO_MQRankSumHard filter: Mapping Quality Rank Sum for SNPs.
params.snpINFO_FSHard filter: Fisher strand bias for SNPs.
params.snpINFO_SORHard filter: Strand Odds Ratio for SNPs.
Cohort-Level Variant Quality Control thresholds
params.QUALMinimum QUAL that a variant is truly polymorphic across the cohort.
params.DPGenotypes with sequencing depth below this threshold are set to missing.
params.GQGenotypes with genotype quality below this threshold are set to missing.
params.ABlowerHeterozygous calls with allele balance below this threshold are set to missing.
params.ABupperHeterozygous calls with allele balance above this threshold are set to missing.
params.VarCallRateVariant call rate across the cohort.
Parallelism (forks)
params.heavyForkNumber of parallel jobs for heavy tasks (alignment, variant calling).
params.lightForkNumber of parallel jobs for lightweight tasks (QC, annotation).
Table 2: GermVarX – Pipeline Parameters
Note
Important: Certain parameters should not be defined directly in params.config. Instead, these input paths must be provided as command line arguments. Refer to Below Section for further details.

Run the Pipeline
After parameter configuration, run the pipeline from the GermVarX directory (where nextflow.config is located):

nextflow run src/main.nf -profile docker [OPTIONS]

To run from another directory:

nextflow run /path/to/project/src/main.nf \
-c /path/to/project/nextflow.config \
-profile docker [OPTIONS]

INPUT Options:

  • FASTQ input
nextflow run src/main.nf -profile docker --inputDir
<path/to/folder_fastq_files>

  • BAM input
nextflow run src/main.nf -profile docker --inputBAM
<path/to/folder_BAM_files>

  • GVCF input (GATK)
nextflow run src/main.nf -profile docker --inputGVCF_gatk
<path/to/folder_GATK_GVCF_files >

  • GVCF input (DeepVariant)
nextflow run src/main.nf -profile docker --inputGVCF_dv
<path/to/folder_DeepVariant_GVCF_files>

  • VCF input (GATK)
nextflow run src/main.nf -profile docker --inputVCF_gatk
<path/to/folder_GATK_VCF_files >

  • VCF input (DeepVariant)
nextflow run src/main.nf -profile docker --inputVCF_dv
<path/to/folder_DeepVariant_VCF_files>

  • FASTQ input and GVCF output
nextflow run src/main.nf -profile docker --inputDir
<path/to/folder_fastq_files> --output_type GVCF

Optional Parameters:
ABC
ParameterDescriptionDefault
-work-dir <workspace_directory>Path to intermediate fileswork
--outputDir <path/to/outputDir>Path to final resultsoutput
--output_type {BAM | GVCF}Stop pipeline early at BAM or GVCF stagenull (full pipeline)
--use_genomicsdbUse GenomicsDBImport instead of CombineGVCFsFALSE
--hard_filterApply hard filters instead of VQSR (recommended for small sample sets)FALSE
--single_sample_modeRun in single-sample mode (no joint genotyping)FALSE

Illustrative test case
To demonstrate the execution of GermVarX, we provide an example of runing the full pipeline on FASTQ data located in the testdata/fastq directory (see Step 4). From the GermVarX root directory (where nextflow.config is located), the full pipeline can be executed with the following command:

nextflow run src/main.nf -profile docker \
--inputDir <path/to/testdata/fastq> \
--outputDir out_testfullpipe

Before execution, users should ensure that all parameters are properly configured, reference paths are correctly specified, and the BED file provided in testdata/bed is included. Upon successful completion, the execution report will be displayed on the terminal (see Figure 1).

Figure 1: Screenshot of the GermVarX execution report and completion summary displayed on the terminal.

The output directory out_testfullpipe will contain the final results, organized into multiple subfolders (see Figure 2) and associated files corresponding to each stage of the pipeline.

Figure 2: Directory structure of out_testfullpipe.

Example with Hard Filtering

To run the workflow with hard filtering enabled (instead of the default VQSR), use the following command:

nextflow run src/main.nf -profile docker \
--inputDir <path/to/testdata/fastq> \
--outputDir out_hardfilter --hard_filter

In this case, the terminal output will reflect the hard filtering process, as illustrated in Figure 3.

Figure 3: Screenshot of the GermVarX execution report and completion summary displayed on the terminal with hard filtering enabled.