Dec 09, 2025

Public workspaceGenome-representative sampling as a framework for diversity studies: a case study on geminiviruses

  • Moshood Olamide Lateef1,2,
  • Dauda Nathaniel3,
  • Bolaji Osundahunsi4,
  • Neil Arvin Bretana1
  • 1IU International University of Applied Sciences, Germany;
  • 2International Institute of Tropical Agriculture, Nigeria;
  • 3Department of Crop Sciences, University of Nigeria Nsukka;
  • 4Department of Entomology and Plant Pathology, University of Arkansas System Division of Agriculture, United States of America
Icon indicating open access to content
QR code linking to this content
Protocol CitationMoshood Olamide Lateef, Dauda Nathaniel, Bolaji Osundahunsi, Neil Arvin Bretana 2025. Genome-representative sampling as a framework for diversity studies: a case study on geminiviruses. protocols.io https://dx.doi.org/10.17504/protocols.io.j8nlkybjdg5r/v1
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: July 03, 2025
Last Modified: December 09, 2025
Protocol Integer ID: 221649
Keywords: Geminiviruses, Diversity, Geographical distribution, Plant viruses, Evolution, geminivirus, representative range of genome, diverse group of plant, available genomic sequence, available genomic sequences alongside the need, genome metadata, using genome metadata, diversity study, studying diversity, genome, systematic method for representative sampling, diverse group, challenges for representative sampling, infecting virus, representative sampling, based evolutionary relationship, phylogeny, specific identity thresholds for each genus, geographical distribution map, evolutionary relationship
Abstract
The rapid increase in publicly available genomic sequences alongside the need for constant updates poses challenges for representative sampling in diversity studies. This protocol describes a systematic method for representative sampling of geminiviruses, a highly diverse group of plant-infecting viruses. By using specific identity thresholds for each genus, this protocol makes sure to select a wide and representative range of genomes that are appropriate for studying diversity and phylogeny-based evolutionary relationships. Additionally, it includes procedures for constructing a geographical distribution map using genome metadata.
Materials
NCBI (GenBank): A comprehensive public database for genomic sequences. https://www.ncbi.nlm.nih.gov/ Sequence demarcation tool (SDT) version 1.3: A tool that classifies sequences based on their percentage pairwise identity. http://web.cbio.uct.ac.za/~brejnev/
NCBI datasets tool: A tool designed to access and retrieve genomic data. https://www.ncbi.nlm.nih.gov/datasets/docs/v2/
Python: An open-source programming language that offers extensive libraries, such as Pandas for data manipulation and Matplotlib for data visualization. https://www.python.org/
Plotly: An open-source library for the creation of interactive charts and dashboards. https://plotly.com/python/
Jupyter Notebook: A tool that enables writing and execution of code. https://jupyter.org/
Laptops and Internet access: Enable smooth means of accessing the database and performing the entire computational analysis.
Troubleshooting
Part 1: Defining the Sampling Criteria
Species demarcation in the Geminiviridae family is based on genome-wide pairwise identity thresholds ranging between 78% and 91% (Brown et al., 2015). This protocol uses genus-specific thresholds above the species demarcation cutoff for all genera except Begomovirus to ensure adequate representative sampling. For Begomovirus, due to its vast diversity, a previously used 80% threshold (Bandoo et al., 2024) was adopted. The adopted grouping threshold for genome selection from each genus was based on the level of the species divergence and the number of available datasets belonging to the genus. Specifically, genome sequences were grouped using a pairwise identity of ≥ 95% for Topocuvirus, Topilevirus, Maldovirus, and Eragrovirus; ≥ 92% for Capulavirus, Citlodavirus, Curtovirus, Becurtovirus, Grablovirus, Mulcrilevirus, Opunvirus, Turncurtovirus, and Welwivirus; and ≥ 80% for Mastrevirus and Begomovirus.
All scripts and processed metadata are available at Geminivirus-diversity
Part 2: Representative Genome Sampling Protocol
Download Sequences:
Retrieve complete nucleotide sequences from GenBank, genus by genus.
For large genera (e.g., Begomovirus, Mastrevirus), retrieval by taxon through the section 'Results by taxon' on the GenBank graphic user interface (GUI) was adopted.
Pairwise Identity Calculation using SDT:
Import the downloaded sequences into SDT v1.3.
Run pairwise identity analysis using the default alignment algorithm.
Once the pairwise identity matrix is generated, click Save → Create Datasets [based on p-identities].
Apply Identity Thresholds:
Input the genus-specific maximum and minimum identity thresholds (e.g., 100% and 92% for Opunvirus).
SDT will cluster the sequences into groups.
Representative Selection:
From each identity group in step 4.2, select a few sequence(s) and gather to represent the genus Opunvirus.
Repeat this process for each genus using the described identity thresholds.
Handling Large Genera (Begomovirus & Mastrevirus):
Partition the taxon (species within the genus) with a large number of sequences into smaller groups.
Process each group separately using SDT as described in step 3.
Use 100% and 80% identity thresholds to generate groups. This also enables the separation of DNA-A and DNA-B for bipartite genomes.
Collect samples from each group to make the taxon representative.
Final Compilation:
Merge all taxon representative sequences for the particular genus.
Process these merged sequences using SDT as described in step 3.
Use 100% and 80% identity thresholds to generate groups.
Collect samples from each group to make the genus representative.
Part 3: Geographical Distribution Mapping
Metadata Retrieval and Processing
Download metadata using NCBI datasets tool with command: datasets summary virus genome taxon opunvirus --as-json-lines | dataformat tsv virus-genome > opunvirus.tsv
Replace Opunvirus with other genus names to retrieve their respective metadata. However, only metadata for the genus Welwivirus was retrieved manually.
Geographical Data Processing:
Remove duplicate in the column 'Geographic Location' using duplicate-removal script and save output as CSV file.
Manually retain one entry per country in the output CSV file of the duplicate-removal script (step 9.1) for each genus.
Generate Unified Dataset:
Compile a single CSV file containing the countries and genera.
Geolocation Assignment:
Use online geolocation tools such as LatLong.net to assign coordinates of different provinces or states for the same country, especially when multiple genera are reported from the same country
Avoid automated geocoding of the countries to maintain unique coordinates (states/provinces) for a country with the multiple genera report.
Save and Format Data:
Save the dataset as CSV UTF-8 (Comma delimited) for compatibility.
Map Generation:
Use the global geminivirus distribution script to generate a geographical distribution map of geminivirus diversity
Protocol references
  1. Brown JK, Zerbini FM, Navas-Castillo J, Moriones E, Ramos-Sobrinho R, Silva JC, Fiallo-Olivé E, Briddon RW, Hernández-Zepeda C, Idris A, Malathi VG. Revision of Begomovirus taxonomy based on pairwise sequence comparisons. Archives of virology. 2015;160:1593–1619. https://doi.org/10.1007/s00705-015-2398-y
  2. Bandoo RA, Kraberger S, Varsani A. Two Novel Geminiviruses Identified in Bees (Apis mellifera and Nomia sp.). Viruses. 2024;16.4:602. https://doi.org/10.3390/v16040602