Dec 26, 2025

Public workspaceMachine learning approach to multi-locus Y-chromosome STR sequence profiling using Covary

  • Marvin De los Santos1
  • 1ChordexBio
Icon indicating open access to content
QR code linking to this content
Protocol CitationMarvin De los Santos 2025. Machine learning approach to multi-locus Y-chromosome STR sequence profiling using Covary. protocols.io https://dx.doi.org/10.17504/protocols.io.ewov1ky2pgr2/v1
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: December 25, 2025
Last Modified: December 26, 2025
Protocol Integer ID: 235821
Keywords: chromosome str sequence profiling, chromosomal str loci from the ncbi strseq bioproject, chromosomal str loci, str sequence similarity, scalable str sequence comparison without manual alignment, forensic genomic, traditional forensic str analysis, scalable str sequence comparison, applications in forensic genomic, str database exploration, compositional features across multiple str loci, multiple str loci, free analysis of short tandem repeat, raw sequence data, ncbi strseq bioproject, locus interpretation, str, locus, short tandem repeat, covary
Disclaimer
Your usage of different Covary versions, Covary-encoder, TIPs-VF and other components of the Covary suite may be limited. Please refer to the license notice at https://github.com/mahvin92/Covary?tab=License-1-ov-file for your review. If you implement Covary using a Colab notebook, please ensure to comply with Google’s Terms of Service. Note that this protocol was tested on Covary v2.1 using the computational resources of a free-tier subscription in Google Colab. Covary is provided "as is", without warranty of any kind, express or implied. For more information, please visit https://covary.chordexbio.com or read our paper at https://doi.org/10.1101/2025.11.13.687960.
Abstract
This protocol describes the use of Covary for rapid, alignment-free analysis of short tandem repeat (STR) sequence variation using Y-chromosomal STR loci from the NCBI STRSeq BioProject (PRJNA380347). The workflow demonstrates the ability of Covary to (i) compare STR sequence similarity directly from raw sequence data, (ii) perform multi-locus STR analysis in a single run without locus-wise concatenation, and (iii) resolve locus-specific and inter-locus relationships using machine learning-derived vector representations.

Traditional forensic STR analysis relies on length-based allele designation and locus-by-locus interpretation. In contrast, this protocol illustrates a sequence-level, machine learning approach that captures internal repeat structure, flanking variation, and compositional features across multiple STR loci simultaneously. Covary enables scalable STR sequence comparison without manual alignment, custom scripting, or local software installation.

This protocol is optimized for execution in Google Colab and may be adapted for applications in forensic genomics, population genetics, and STR database exploration.
Troubleshooting
Protocol Introduction
This protocol documents the application of Covary for the following STR-focused analytic workflows:
  1. Sequence-level STR similarity analysis
  2. Multi-locus STR comparison in a single analytical run
  3. Relationship resolution within and between STR loci

The scope of this protocol is limited to Y-chromosomal STR sequence data obtained from the NCBI STRSeq BioProject (PRJNA380347). The analysis focuses on sequence similarity and clustering behavior rather than forensic allele calling or haplotype frequency estimation. Unlike marker-based phylogenetic protocols (e.g., 16S/18S rRNA) or whole-genome phylogenomics, this protocol demonstrates Covary’s performance on short, highly repetitive, and locus-specific sequences, which present distinct analytical challenges.

This protocol is designed to be performed using Covary v2.1 on Google Colab (Figure 1).


Figure 1. Covary v.2.1 interface on Google Colab.

Overview of Covary for STR Analysis
Covary consists of two core computational components relevant to STR analysis:
  1. Covary-encoder: A proprietary genetic encoding logic that transforms nucleotide sequences into numerical vector representations. The encoder is built using Translator-Interpreter Pre-seeding for Variable-length Fragment (TIPs-VF), enabling length-aware, sequence perturbation-sensitive and position-aware representation of STR sequences without requiring alignment.
  2. Neural Network: A deep learning model adapted from Keras that performs similarity learning across encoded STR vectors, allowing comparative analysis across loci and samples

For STR data, Covary operates without prior assumptions about repeat motif size, allele length, or locus identity, enabling:
  • Direct comparison of STR sequences with different repeat counts
  • Multi-locus analysis without concatenation or progressive-stratification (coalescence)
  • Visualization of sequence-level relationships across loci

The Covary workflow follows the same operational steps described in phylogenetic and phylogenomic protocols, including parameter configuration, data upload, QC, encoding, inference, scoring, and result export.
Data Acquisition and Preparation
Obtain Y-STR sequences from the NCBI STRSeq BioProject.

Visit the STRSeq NCBI portal at https://www.ncbi.nlm.nih.gov/bioproject/380127
Navigate to 'Project' section and select the BioProject accession PRJNA380347.
Scroll to the 'Project Data' section and click the link count in the 'Nucleotide (Genomic DNA)' table row. Note that as of writing this protocol, the database contains 562 links or sequence entries.
Download and compile the sequences in a multi-FASTA file, as shown in Table 1 (recommended in .fasta file type).

BioProjectaccessionOrganismTitle
PRJNA396118Homo sapiensSTRSeq DYF387S1 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396119Homo sapiensSTRSeq DYS19 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396120Homo sapiensSTRSeq DYS385 a/b Sequence-Based Alleles (National Institute of Standards...)
PRJNA396122Homo sapiensSTRSeq DYS389 I/II Sequence-Based Alleles (National Institute of Standards...)
PRJNA396123Homo sapiensSTRSeq DYS390 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396124Homo sapiensSTRSeq DYS391 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396125Homo sapiensSTRSeq DYS392 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396126Homo sapiensSTRSeq DYS393 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396127Homo sapiensSTRSeq DYS437 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396128Homo sapiensSTRSeq DYS438 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396129Homo sapiensSTRSeq DYS439 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396130Homo sapiensSTRSeq DYS448 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396131Homo sapiensSTRSeq DYS456 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396132Homo sapiensSTRSeq DYS458 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396134Homo sapiensSTRSeq DYS461 and DYS460 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396135Homo sapiensSTRSeq DYS481 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396136Homo sapiensSTRSeq DYS505 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396137Homo sapiensSTRSeq DYS522 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396138Homo sapiensSTRSeq DYS533 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396139Homo sapiensSTRSeq DYS549 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396140Homo sapiensSTRSeq DYS570 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396141Homo sapiensSTRSeq DYS576 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396142Homo sapiensSTRSeq DYS612 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396143Homo sapiensSTRSeq DYS635 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396144Homo sapiensSTRSeq DYS643 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396145Homo sapiensSTRSeq Y-GATA-H4 Sequence-Based Alleles (National Institute of Standards...)
Table 1. Y-STR sequence loci involved in the analysis, as presented in https://www.ncbi.nlm.nih.gov/bioproject/380347.

Alternatively, you can download the dataset below:

STR Profiling using Covary
Feed the compiled STR-Seq data to Covary by uploading it in Step 2 or follow the protocol described previously (Machine learning-based phylogenetic analysis using Covary).
Once upload is complete, wait until Covary finishes learning from the data.
After model training, Covary will score and analyze your input. Results will be downloaded after the run (Step 8. Download results).
Training Validation
Validate the training results and parameters in Covary by inspecting the epoch count, the total number of batches, training duration, the loss and MAE in Step 6. Deep leaning.
Inspect the pattern of clustering and pairwise distances of the embedding plots. The presence of a well defined clustering pattern (either superimposition or closed grouping) in the different dimensionality reduction plots and presence of distinct blocks in the heatmaps are normally indicative of successful comparative sequence representation.
Assume or select the model that will suite your research objective.
For the purpose of this exercise, t-SNE will be used to visualize the vector embeddings of the Y STR-seq profiles. The t-SNE distance matrix will be used to analyze pairwise relationship of the sequences, and the complete hierarchichal linkage was used to reconstruct the dendrogram.
Interpretation of Results
Inspect the quality and clustering patterns generated in t-SNE embeddings and pairwise heatmap plots, as shown in Figure 2 and Figure 3 below.


Figure 2. Representative plot of the Y-STR-seq vector embeddings, visualized using t-SNE. The color gradient represents the arrangement of the sequence entries in the .fasta file (index).


Figure 3. Representative heatmap plot of the Y-STR-seq vector distance metrics, analyzed using the Euclidean method. The color gradient represents the distance metrics, while the arrangement of the sequence data on the plot was similar to the order of sequences in the .fasta file

Evaluate the tree topologies, clusters, clade formation, and branch length, as shown in Figure 4.

Figure 4. Hierarchical clustering of the different Y STR-seq loci used in this analysis.

The result of this exercise showed that Covary resolved multi-locus grouping or clustering based on the expected Y-STR-seq groupings, as listed in Table 1. Additionally, the result captured in-group marker relationship of repeats in the analyzed loci, for example in DYS19 marker, MK990411.2 and MT607263.1 are closer to each than that of MT607264.1 and MK990412.2, which is expected since the first 2 harbor DYS19 15 repeats while the third contains DYS19 17 repeats and the fourth has DYS19 16 repeats, as follows:

  • MK990411.2: DYS19 15 TA[2]TCTA[12]CCTA[1]TCTA[3]AAACAC[1]TA[6]ACAC[1]TA[5]ATAC[1]TA[5]
  • MT607263.1: DYS19 15 TA[2]TCTA[13]CCTA[1]TCTA[2]AAACAC[1]TA[6]ACAC[1]TA[5]ATAC[1]TA[5]
  • MT607264.1: DYS19 17 TA[2]TCTA[14]CCTA[1]TCTA[3]AAACAC[1]TA[6]ACAC[1]TA[5]ATAC[1]TA[5]
  • MK990412.2: DYS19 16 TA[2]TCTA[13]CCTA[1]TCTA[3]AAACAC[1]TA[6]ACAC[1]TA[5]ATAC[1]TA[5]


Visualized topology for the selected DYS19 marker.
┌────────── MT607264.1 (DYS19 17)
┌───────┤
│ └────────── MK990412.2 (DYS19 16)
─────────────┤
│ ┌── MT607263.1 (DYS19 15)
└───────┤
└── MK990411.2 (DYS19 15)

Application Note
This protocol demonstrates that Covary can be applied to:
  • Alignment-free STR sequence analysis
  • STR database exploration and validation
  • Comparative STR genomics beyond allele length metrics

In population-level STR-seq comparison, correlating the relationship of these markers with the STR profile of an individual or group of individuals may reveal kinship (paternal lineage relationship for Y-STR or other complex kinship relationships when X-STRs could be included) or may have applications in human identification when autosomal STR loci are to be performed and analyzed. Overall, this protocol describes the potential use of Covary in multi-loci Y-STR profiling and potentially, other STR-seq based approach in forensic studies.