4.2.1. Merging fastq files from individual lanes and/or libraries (Optional)
4.2.1.1 Depending on the type of instrument used for sequencing, one or multiple R1/R2 fastq files per
library may result from individual lanes of a flow cell. The fastq files from individual lanes should
be merged into single R1.fastq and single R2.fastq files to simplify the following steps. This is an
example of fastq files obtained from HiSeq 4 lane sequencing:
> mylibrary_L001_R1.fastq.gz, mylibrary_L002_R1.fastq.gz,
mylibrary_L003_R1.fastq.gz, mylibrary_L004_R1.fastq.gz
> mylibrary_L001_R2.fastq.gz, mylibrary_L002_R2.fastq.gz,
mylibrary_L003_R2.fastq.gz, mylibrary_L004_R2.fastq.gz
4.2.1.2 To merge the fastq files from different lanes use a cat command in a terminal. This will generate
two files: mylibrary_R1.fastq.gz and mylibrary_R2.fastq.gz, containing the information of the
> cat mylibrary_L001_R1.fastq.gz mylibrary_L002_R1.fastq.gz
mylibrary_L003_R1.fastq.gz mylibrary_L004_R1.fastq.gz >
> cat mylibrary_L001_R2.fastq.gz mylibrary_L002_R2.fastq.gz
mylibrary_L003_R2.fastq.gz mylibrary_L004_R2.fastq.gz >
4.2.1.3 Move these 2 fastq files into a new folder, which will be referenced in this manual as $fastqfolder.
NOTE: This step can also be done if you sequenced your library in multiple sequencing runs.
Warning: The order of merging files should be kept the same (for e.g., L001, L002, L003, L004, not L002,
L001 ...) to avoid issues when demultiplexing the samples.
4.2.2. Sequencing data quality check
4.2.2.1 Run fastQC on both R1 and R2 fastq files. Use –outdir option to indicate the path to the output
directory. This directory will contain HTML reports produced by the software.
> fastqc –-outdir $QCdir/ mylibrary_R1.fastq.gz
> fastqc –-outdir $QCdir/ mylibrary_R2.fastq.gz
4.2.2.2 Check fastQC reports to assess the quality of the samples (see Software and materials).
• The report for the R1 fastq file may contain some "red flags" because it contains barcodes/UMIs. Still,
it can provide useful information on the sequencing quality of the barcodes/UMIs.
• The main point of this step is to check the R2 fastq report. Of note, per base sequence content and
kmer content are rarely green. If there is some adapter contamination or overrepresented sequence
detected in the data, it may not be an issue (if the effect is limited to <10~20%). These are lost reads
but most of them will be filtered out during the next step.
4.2.3. Preparing the reference genome
The fastq files must be aligned (or “mapped”) on a reference genome. The STAR (Dobin et al., 20131)
aligner is one of the most efficient tools for RNA-seq reads mapping. It contains a “soft-clipping” tool that
automatically cuts the beginning or the end of reads to improve the mapping efficiency, thus allowing the
user to skip the step of trimming the reads for adapter contamination. Moreover, STAR has a mode called
STARsolo, designed to align multiplexed data (such as DRUG-seq) and directly generate count matrices.
The STAR aligner requires a genome assembly together with a genome index file. The index file
generation is a time-consuming process that is only performed once on a given genome assembly so that
it can be completed in advance and the index files can be stored on the server for subsequent analyses.
4.2.3.1 Download the correct genome assembly fasta file (e.g.,
Homo_sapiens.GRCh38.dna.primary_assembly.fa) and gene annotation file in gtf format (e.g.,
Homo_sapiens.GRCh38.108.gtf) from Ensembl or UCSC repository. Below is an example of a
> wget https://ftp.ensembl.org/pub/release-
108/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
> gzip -d Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz # unzip
> wget https://ftp.ensembl.org/pub/release-
108/gtf/homo_sapiens/Homo_sapiens.GRCh38.108.gtf.gz
> gzip -d Homo_sapiens.GRCh38.108.gtf.gz # unzip
NOTE: It’s recommended to download the primary_assembly fasta file when possible (without the ‘sm’ or
‘rm’ tags). If not available, download the top_level assembly. For the gtf, download the one that does not
have the ‘chr’ or ‘abinitio’ tags.
4.2.3.2 Use STAR to create an index for the genome assembly. Indicate the output folder name
containing the index files using --genomeDir option:
> STAR --runMode genomeGenerate --genomeDir /path/to/genomeDir --
genomeFastaFiles Homo_sapiens.GRCh38.dna.primary_assembly.fa --sjdbGTFfile
Homo_sapiens.GRCh38.108.gtf --runThreadN 8
• The --runThreadN parameter can be modified depending on the number of cores available on your
machine. The larger this number is, the more parallelized/fast the indexing will be.
• STAR can use up to 32-40Gb of RAM depending on the genome assembly. So, you should use a
machine that has this RAM capacity.
4.2.4. Aligning to the reference genome and generation of count matrices
After the genome index is created, both R1 and R2 fastq files can be aligned to this reference genome.
For this step, use the “solo” mode of STAR, which not only aligns the reads to the reference genome but
also creates the gene read count and UMI (unique molecular identifier) count matrices.
The following parameters should be adjusted according to the sequencing information:
• --soloCBwhitelist: a text file with the list of barcodes (one barcode sequence per lane) which is used
by STAR for demultiplexing. This file is provided according to version of the MERCURIUS kit used.
Example of “barcodes_96_V5D_star.txt”:
• --soloCBstart: Start position of the barcode in the R1 fastq file, equal to 1.
• --soloCBlen: Length of the barcode. This value should match the length of the barcode sequence in
the file specified by –soloCBwhitelist. The barcode length depends on the version of the oligo-dT
barcodes provided in the kit. For the barcode plate set V5, the default value is 14.
• --soloUMIstart: Start position of the UMI, it’s soloCBlen + 1 since the UMI starts right after the barcode
• --soloUMIlen: The length of UMI. This parameter depends on the version of the oligo-dT barcodes in
the kit and the number of sequencing cycles performed for Read1. For the barcode plate set V5 the
• --readFilesIn: name and path to the input fastq files.
The order of the fastq files provided in the script is important. The first fastq must contain genomic
information, while the second the barcode and UMI content. Thus, files should be provided for STARsolo
in the following order: --readFilesIn mylibray_R2 mylibrary_R1.
This step will output bam files and count matrices in the folder $bamdir.
• --genomeDir: a path to the genome indices directory generated before ($genomeDir).
Output count matrix parameters:
By default, STARsolo produces a UMI count matrix, i.e., containing unique non-duplicated reads per
sample for each gene. This type of count data is a standard for single-cell RNA-seq analysis. For bulk
RNA-seq analysis, a gene read count matrix is usually used. The following parameters will enable the
generation of the output of interest.
--soloUMIdedup NoDedup, will generate a read count matrix output
--soloUMIdedup NoDedup 1MM_Directional, will generate both UMI and read count matrices in mtx
> STAR --runMode alignReads --outSAMmapqUnique 60 --runThreadN 8 --
outSAMunmapped Within --soloStrand Forward --quantMode GeneCounts --
outBAMsortingThreadN 8 --genomeDir $genomeDir --soloType CB_UMI_Simple --
soloCBstart 1 --soloCBlen 14 --soloUMIstart 15 --soloUMIlen 14 --
soloUMIdedup NoDedup 1MM_Directional --soloCellFilter None --soloCBwhitelist
barcodes.txt --soloBarcodeReadLength 0 --soloFeatures Gene --
outSAMattributes NH HI nM AS CR UR CB UB GX GN sS sQ sM --
outFilterMultimapNmax 1 --readFilesCommand zcat --outSAMtype BAM
SortedByCoordinate --outFileNamePrefix $bamdir --readFilesIn
mylibrary_R2.fastq.gz mylibrary_R1.fastq.gz
This step will output bam files and count matrices in the folder $bamdir.
The demultiplexing statistics can be found in the “bamdir/Solo.out/Barcodes.stats” file.
The alignment quality and performance metrics can be found in the “bamdir/Log.final.out” file.
NOTE: The most important statistic at this step is the proportion of “Uniquely mapped reads” which is
expected to be greater than 70% (for human, mouse or drosophila).
4.2.5. Generating the count matrix from .mtx file
STARsolo will generate a count matrix (matrix.mtx file) located in the bamdir/Solo.out/Gene/raw folder.
This file is a sparse matrix format that can be transformed into a standard count matrix using an R script
> matrix_dir <- "$bamdir/Solo.out/Gene/raw"
> f <- file(paste0(matrix_dir, "matrix.mtx"), "r")
> mat <- as.data.frame(as.matrix(readMM(f)))
> feature.names = fread(paste0(matrix_dir, "features.tsv"), header = FALSE,
stringsAsFactors = FALSE, data.table = F)
> barcode.names = fread(paste0(matrix_dir, "barcodes.tsv"), header = FALSE,
stringsAsFactors = FALSE, data.table = F)
> colnames(mat) <- barcode.names$V1
> rownames(mat) <- feature.names$V1
> fwrite(mat, file = umi.counts.txt, sep = "\t", quote = F, row.names = T,
The resulting UMI/gene count matrix can be used for a standard expression analysis following
conventional bioinformatic tools.
4.2.6. Generating the read count matrix with per-sample stats (Optional)
Given a multiplex BAM file obtained with STARsolo and a set of barcodes, the software FastReadCounter
produces a read count matrix with per-sample statistics with the following code:
> gtf_file=homo_sapience.gtf ### GTF genome annotation file
> output_folder=counts/ ### Name of the final count output file
> bam_dir=mypath/bam_demult ### Directory with demultiplexed bam
> barcode_file=V5D_96_frc.txt ### Barcode reference file
> FastReadCounter-1.0.jar" --bam ${bam_dir}/${bam_dir}.bam \
> --barcodeFile \${barcode_file} \
The resulting read count matrices can be used for subsequent gene expression analysis using established
NOTE: Please contact us at [email protected] in case you don’t have the barcode sequences (in your email, please indicate the name of the barcode set and the PN of the barcode module).
4.2.7. Demultiplexing bam files (Optional)
Generation of demultiplexed bam files, i.e., individual bam files for each sample, might be needed in some
cases, for example, for submitting the raw data to an online repository that does not accept multiplexed
data (for example, GEO or ArrayExpress), or for running an established bulk RNA-seq data analysis
For this purpose, the Picard tool can be used with the following parameters:
• $out_dir, The output directory for demultiplexed bam files
• $path_to_bam, the path to multiplexed single bam file
• $barcode_brb.txt, tab-delimited file containing 2 columns: sample_id and barcode seq. Example of
> Sample4 ACGAGCAGATGCAG$
NOTE: This file is different from the list of barcode files provided to STAR.
Run the following Picard script:
> demultiplexed_bam_out_dir=$out_dir
> barcode_info=$barcode_brb.txt
> while IFS=$'\t' read -r -a line
> java -jar /path/to/picard.jar FilterSamReads \
> O=${demultiplexed_bam_out_dir}/${sample_id}.bam \
> TAG=CR TAG_VALUE=${tag_value} \
> FILTER=includeTagValues
NOTE: Please contact us at [email protected] in case you don’t have the barcode sequences (in your email, please indicate the name of the barcode set and the PN of the barcode module).