Machine learning approach to multi-locus Y-chromosome STR sequence profiling using Covary

Marvin De los Santos

Dec 26, 2025

Machine learning approach to multi-locus Y-chromosome STR sequence profiling using Covary

DOI

https://dx.doi.org/10.17504/protocols.io.ewov1ky2pgr2/v1

Marvin De los Santos¹

¹ChordexBio

Marvin De los Santos

ChordexBio

DOI: https://dx.doi.org/10.17504/protocols.io.ewov1ky2pgr2/v1

Protocol Citation: Marvin De los Santos 2025. Machine learning approach to multi-locus Y-chromosome STR sequence profiling using Covary. protocols.io https://dx.doi.org/10.17504/protocols.io.ewov1ky2pgr2/v1

License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Protocol status: Working

We use this protocol and it's working

Created: December 25, 2025

Last Modified: December 26, 2025

Protocol Integer ID: 235821

Keywords: chromosome str sequence profiling, chromosomal str loci from the ncbi strseq bioproject, chromosomal str loci, str sequence similarity, scalable str sequence comparison without manual alignment, forensic genomic, traditional forensic str analysis, scalable str sequence comparison, applications in forensic genomic, str database exploration, compositional features across multiple str loci, multiple str loci, free analysis of short tandem repeat, raw sequence data, ncbi strseq bioproject, locus interpretation, str, locus, short tandem repeat, covary

Disclaimer

Your usage of different Covary versions, Covary-encoder, TIPs-VF and other components of the Covary suite may be limited. Please refer to the license notice at https://github.com/mahvin92/Covary?tab=License-1-ov-file for your review. If you implement Covary using a Colab notebook, please ensure to comply with Google’s Terms of Service. Note that this protocol was tested on Covary v2.1 using the computational resources of a free-tier subscription in Google Colab. Covary is provided "as is", without warranty of any kind, express or implied. For more information, please visit https://covary.chordexbio.com or read our paper at https://doi.org/10.1101/2025.11.13.687960.

Abstract

This protocol describes the use of Covary for rapid, alignment-free analysis of short tandem repeat (STR) sequence variation using Y-chromosomal STR loci from the NCBI STRSeq BioProject (PRJNA380347). The workflow demonstrates the ability of Covary to (i) compare STR sequence similarity directly from raw sequence data, (ii) perform multi-locus STR analysis in a single run without locus-wise concatenation, and (iii) resolve locus-specific and inter-locus relationships using machine learning-derived vector representations.

Traditional forensic STR analysis relies on length-based allele designation and locus-by-locus interpretation. In contrast, this protocol illustrates a sequence-level, machine learning approach that captures internal repeat structure, flanking variation, and compositional features across multiple STR loci simultaneously. Covary enables scalable STR sequence comparison without manual alignment, custom scripting, or local software installation.

This protocol is optimized for execution in Google Colab and may be adapted for applications in forensic genomics, population genetics, and STR database exploration.

Protocol Introduction

This protocol documents the application of Covary for the following STR-focused analytic workflows:
Sequence-level STR similarity analysis
Multi-locus STR comparison in a single analytical run
Relationship resolution within and between STR loci

The scope of this protocol is limited to Y-chromosomal STR sequence data obtained from the NCBI STRSeq BioProject (PRJNA380347). The analysis focuses on sequence similarity and clustering behavior rather than forensic allele calling or haplotype frequency estimation. Unlike marker-based phylogenetic protocols (e.g., 16S/18S rRNA) or whole-genome phylogenomics, this protocol demonstrates Covary’s performance on short, highly repetitive, and locus-specific sequences, which present distinct analytical challenges.

This protocol is designed to be performed using Covary v2.1 on Google Colab (Figure 1).


Figure 1. Covary v.2.1 interface on Google Colab.

Overview of Covary for STR Analysis

Covary consists of two core computational components relevant to STR analysis:
Covary-encoder: A proprietary genetic encoding logic that transforms nucleotide sequences into numerical vector representations. The encoder is built using Translator-Interpreter Pre-seeding for Variable-length Fragment (TIPs-VF), enabling length-aware, sequence perturbation-sensitive and position-aware representation of STR sequences without requiring alignment.
Neural Network: A deep learning model adapted from Keras that performs similarity learning across encoded STR vectors, allowing comparative analysis across loci and samples

For STR data, Covary operates without prior assumptions about repeat motif size, allele length, or locus identity, enabling:
Direct comparison of STR sequences with different repeat counts
Multi-locus analysis without concatenation or progressive-stratification (coalescence)
Visualization of sequence-level relationships across loci

The Covary workflow follows the same operational steps described in phylogenetic and phylogenomic protocols, including parameter configuration, data upload, QC, encoding, inference, scoring, and result export.

Data Acquisition and Preparation

Obtain Y-STR sequences from the NCBI STRSeq BioProject.

Visit the STRSeq NCBI portal at https://www.ncbi.nlm.nih.gov/bioproject/380127

Navigate to 'Project' section and select the BioProject accession PRJNA380347.

Scroll to the 'Project Data' section and click the link count in the 'Nucleotide (Genomic DNA)' table row. Note that as of writing this protocol, the database contains 562 links or sequence entries.

Download and compile the sequences in a multi-FASTA file, as shown in Table 1 (recommended in .fasta file type).

BioProjectaccessionOrganismTitle
PRJNA396118Homo sapiensSTRSeq DYF387S1 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396119Homo sapiensSTRSeq DYS19 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396120Homo sapiensSTRSeq DYS385 a/b Sequence-Based Alleles (National Institute of Standards...)
PRJNA396122Homo sapiensSTRSeq DYS389 I/II Sequence-Based Alleles (National Institute of Standards...)
PRJNA396123Homo sapiensSTRSeq DYS390 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396124Homo sapiensSTRSeq DYS391 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396125Homo sapiensSTRSeq DYS392 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396126Homo sapiensSTRSeq DYS393 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396127Homo sapiensSTRSeq DYS437 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396128Homo sapiensSTRSeq DYS438 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396129Homo sapiensSTRSeq DYS439 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396130Homo sapiensSTRSeq DYS448 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396131Homo sapiensSTRSeq DYS456 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396132Homo sapiensSTRSeq DYS458 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396134Homo sapiensSTRSeq DYS461 and DYS460 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396135Homo sapiensSTRSeq DYS481 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396136Homo sapiensSTRSeq DYS505 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396137Homo sapiensSTRSeq DYS522 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396138Homo sapiensSTRSeq DYS533 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396139Homo sapiensSTRSeq DYS549 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396140Homo sapiensSTRSeq DYS570 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396141Homo sapiensSTRSeq DYS576 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396142Homo sapiensSTRSeq DYS612 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396143Homo sapiensSTRSeq DYS635 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396144Homo sapiensSTRSeq DYS643 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396145Homo sapiensSTRSeq Y-GATA-H4 Sequence-Based Alleles (National Institute of Standards...)
Table 1. Y-STR sequence loci involved in the analysis, as presented in https://www.ncbi.nlm.nih.gov/bioproject/380347.
 

Alternatively, you can download the dataset below:
Dataset
Y-STR sequences
NAME
https://github.com/mahvin92/Covary/blob/main/testing/sample%20data/Y_STRSeq_dataset.fasta
LINK

STR Profiling using Covary

Feed the compiled STR-Seq data to Covary by uploading it in Step 2 or follow the protocol described previously (Machine learning-based phylogenetic analysis using Covary).

Once upload is complete, wait until Covary finishes learning from the data.

After model training, Covary will score and analyze your input. Results will be downloaded after the run (Step 8. Download results).

Training Validation

Validate the training results and parameters in Covary by inspecting the epoch count, the total number of batches, training duration, the loss and MAE in Step 6. Deep leaning.

Inspect the pattern of clustering and pairwise distances of the embedding plots. The presence of a well defined clustering pattern (either superimposition or closed grouping) in the different dimensionality reduction plots and presence of distinct blocks in the heatmaps are normally indicative of successful comparative sequence representation.

Assume or select the model that will suite your research objective.

For the purpose of this exercise, t-SNE will be used to visualize the vector embeddings of the Y STR-seq profiles. The t-SNE distance matrix will be used to analyze pairwise relationship of the sequences, and the complete hierarchichal linkage was used to reconstruct the dendrogram.

Interpretation of Results

Inspect the quality and clustering patterns generated in t-SNE embeddings and pairwise heatmap plots, as shown in Figure 2 and Figure 3 below.

Figure 2. Representative plot of the Y-STR-seq vector embeddings, visualized using t-SNE. The color gradient represents the arrangement of the sequence entries in the .fasta file (index).

Figure 3. Representative heatmap plot of the Y-STR-seq vector distance metrics, analyzed using the Euclidean method. The color gradient represents the distance metrics, while the arrangement of the sequence data on the plot was similar to the order of sequences in the .fasta file

Evaluate the tree topologies, clusters, clade formation, and branch length, as shown in Figure 4.

Figure 4. Hierarchical clustering of the different Y STR-seq loci used in this analysis.

The result of this exercise showed that Covary resolved multi-locus grouping or clustering based on the expected Y-STR-seq groupings, as listed in Table 1. Additionally, the result captured in-group marker relationship of repeats in the analyzed loci, for example in DYS19 marker, MK990411.2 and MT607263.1 are closer to each than that of MT607264.1 and MK990412.2, which is expected since the first 2 harbor DYS19 15 repeats while the third contains DYS19 17 repeats and the fourth has DYS19 16 repeats, as follows:

MK990411.2: DYS19 15 TA[2]TCTA[12]CCTA[1]TCTA[3]AAACAC[1]TA[6]ACAC[1]TA[5]ATAC[1]TA[5]
MT607263.1: DYS19 15 TA[2]TCTA[13]CCTA[1]TCTA[2]AAACAC[1]TA[6]ACAC[1]TA[5]ATAC[1]TA[5]
MT607264.1: DYS19 17 TA[2]TCTA[14]CCTA[1]TCTA[3]AAACAC[1]TA[6]ACAC[1]TA[5]ATAC[1]TA[5]
MK990412.2: DYS19 16 TA[2]TCTA[13]CCTA[1]TCTA[3]AAACAC[1]TA[6]ACAC[1]TA[5]ATAC[1]TA[5]


Visualized topology for the selected DYS19 marker.
                     ┌────────── MT607264.1 (DYS19 17)
             ┌───────┤
             │       └────────── MK990412.2 (DYS19 16)
─────────────┤
             │       ┌── MT607263.1 (DYS19 15)
             └───────┤
                     └── MK990411.2 (DYS19 15)

Application Note

This protocol demonstrates that Covary can be applied to:
Alignment-free STR sequence analysis
STR database exploration and validation
Comparative STR genomics beyond allele length metrics

In population-level STR-seq comparison, correlating the relationship of these markers with the STR profile of an individual or group of individuals may reveal kinship (paternal lineage relationship for Y-STR or other complex kinship relationships when X-STRs could be included) or may have applications in human identification when autosomal STR loci are to be performed and analyzed. Overall, this protocol describes the potential use of Covary in multi-loci Y-STR profiling and potentially, other STR-seq based approach in forensic studies.

BioProjectaccession	Organism	Title
PRJNA396118	Homo sapiens	STRSeq DYF387S1 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396119	Homo sapiens	STRSeq DYS19 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396120	Homo sapiens	STRSeq DYS385 a/b Sequence-Based Alleles (National Institute of Standards...)
PRJNA396122	Homo sapiens	STRSeq DYS389 I/II Sequence-Based Alleles (National Institute of Standards...)
PRJNA396123	Homo sapiens	STRSeq DYS390 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396124	Homo sapiens	STRSeq DYS391 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396125	Homo sapiens	STRSeq DYS392 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396126	Homo sapiens	STRSeq DYS393 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396127	Homo sapiens	STRSeq DYS437 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396128	Homo sapiens	STRSeq DYS438 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396129	Homo sapiens	STRSeq DYS439 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396130	Homo sapiens	STRSeq DYS448 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396131	Homo sapiens	STRSeq DYS456 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396132	Homo sapiens	STRSeq DYS458 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396134	Homo sapiens	STRSeq DYS461 and DYS460 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396135	Homo sapiens	STRSeq DYS481 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396136	Homo sapiens	STRSeq DYS505 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396137	Homo sapiens	STRSeq DYS522 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396138	Homo sapiens	STRSeq DYS533 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396139	Homo sapiens	STRSeq DYS549 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396140	Homo sapiens	STRSeq DYS570 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396141	Homo sapiens	STRSeq DYS576 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396142	Homo sapiens	STRSeq DYS612 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396143	Homo sapiens	STRSeq DYS635 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396144	Homo sapiens	STRSeq DYS643 Sequence-Based Alleles (National Institute of Standards...)
PRJNA396145	Homo sapiens	STRSeq Y-GATA-H4 Sequence-Based Alleles (National Institute of Standards...)