Patient-derived MTSs and their corresponding original BRC tissues from patient (P136) were used for scRNA-seq analysis in this study. The single-cell transcriptome libraries were constructed using the 10x Genomics platform. The scRNA-seq libraries were prepared with the Chromium Next GEM Single Cell 3ʹ Library and Gel Beads Kit v3.1 (PN-1000121) according to the manufacturer’s instructions. Quantification of cDNA and DNA libraries for each sample was performed using Qubit 4 Fluorometers (Thermo Fisher Scientific), and quality control was conducted with the Agilent 2100 Bioanalyzer (Agilent Technologies). All single-cell transcriptomes were sequenced on the Ilumina Novaseq 6000 platform conducted by OE Biotech Co., Ltd. (Shanghai, China), generating paired-end reads with a length of 150 bp.
After sequencing, the scRNA-seq data underwent processing and clustering. The CellRanger version 6.0.1 (10x Genomics) was utilized for aligning reads to the human reference genome (GRCh38), and the raw count matrix for each sample was obtained from the CellRanger unique molecular identifier (UMI) matrix output. Genes expressed in at least 0.1% cells were retained for downstream analyses. Initially, the percentage of counts originating from mitochondrial RNA and heat shock-related RNA per cell was calculated first to ensure high-quality datasets. Subsequently, cells were filtered based on higher-quality characteristics, including mitochondrial reads below 10%, a number of detected genes ranging from 400 to 8,000, and a number of UMIs ranging from 500 to 50,000. Python package Scrublet90 was applied to estimate potential doublets in each sample with the expected doublet rate of 5%. Standard preprocessing steps involved normalizing feature expression measurements for each cell by total expression, followed by multiplication with a scale factor (1e4), and log transformation of the result. Subsequently, we scaled expression values regressing out the percent of mitochondrial counts, count numbers, detected genes and heat shock-related genes. We selected the top 5,000 most variable genes for downstream analyses. Principal component analysis (PCA) was performed as a linear dimensionality reduction on the scaled data with 50 principal components retained. To correct the batch effects arising from different samples, we applied BBKNN91 to generate a batch-balanced k nearest neighbor (KNN) graph with parameters neighbors_within_batch = 3.
Subsequently, we employed the Leiden algorithm to iteratively cluster cells together, initially setting the resolution at 0.2. We identified the major 8 cell clusters based on well-established cell markers. For visualization purposes, Manifold Approximation and Projection (UMAP) was applied to the KNN graph using a Python-based kernel. Furthermore, subpopulations within each major cell cluster were identified following the same aforementioned procedure starting from the unfiltered UMI matrix. To obtain different fine-grained clustering results, we varied the parameter resolution in the Louvain algorithm from 0.3 to 0.9 for each sub-clustering analysis. All clustering analyses were performed using Scanpy (version 1.8.1), a Python-based toolkit.
To annotate subpopulations within each major cell cluster, we initially generated the normalized SCT matrix from UMI count data using SCTransform, an R-based toolkit provided by Seurat (version 4.0.5). Subsequently, differentially expressed genes (DEGs) were identified on the SCT matrix using FindAllMarkers (MAST test with Bonferroni correction for multiple testing; adjusted P < 0.05). We exclusively considered genes that were detected in at least 10% of the cells within the cluster and exhibited an average fold difference of at least 0.25-fold (log-scale) between the cells in the cluster and all other cells. The cell cluster identities were determined by top-ranked DEGs and previously reported biologically related genes. All single-cell RNA-seq data that support the finding of this study has been deposited at SequenceRead Archive (SRA) under accession number PRJNA1165172.