Differential Gene Expression Analysis of TCGA Cancer Transcriptomes with DESeq2 and Pathway Enrichment

Hussain Zubair

Mar 27, 2026

Differential Gene Expression Analysis of TCGA Cancer Transcriptomes with DESeq2 and Pathway Enrichment

DOI

https://dx.doi.org/10.17504/protocols.io.kqdg3mn6el25/v1

Hussain Zubair¹

¹zzu

Hussain Zubair

zzu

DOI: https://dx.doi.org/10.17504/protocols.io.kqdg3mn6el25/v1

Protocol Citation: Hussain Zubair 2026. Differential Gene Expression Analysis of TCGA Cancer Transcriptomes with DESeq2 and Pathway Enrichment. protocols.io https://dx.doi.org/10.17504/protocols.io.kqdg3mn6el25/v1

License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Protocol status: Working

We use this protocol and it's working

Created: March 27, 2026

Last Modified: March 27, 2026

Protocol Integer ID: 313992

Keywords: Differential expression, RNA-seq, DESeq2, TCGA, TCGAbiolinks, GSEA, ORA, GSVA, clusterProfiler, cancer genomics, Bioconductor, differential gene expression analysis of tcga cancer transcriptome, tcga cancer transcriptome, differential gene expression analysis, tcga rna, validated workflow for differential gene expression analysis, data retrieval via tcgabiolink, tcgabiolink, gene list, pathway enrichment this protocol, gene id conversion, tcga, pathway analysis approach, pathway enrichment, seq data, differential expression, level scoring with gsva

Abstract

This protocol provides a validated workflow for differential gene expression analysis of TCGA RNA-seq data. It covers data retrieval via TCGAbiolinks, normalization decisions, differential expression with DESeq2 and apeglm shrinkage, gene ID conversion, and three pathway analysis approaches: ORA, GSEA, and sample-level scoring with GSVA. Each step includes benchmark values validated against TCGA-LUAD (n=540 tumor, 59 normal). The protocol addresses common pitfalls including inappropriate use of TPM for statistical testing and pre-filtering of gene lists before GSEA.

Materials

Required R/Bioconductor packages (minimum tested versions):

- TCGAbiolinks v2.38.0+: TCGA/GDC data retrieval
- DESeq2 v1.50.0+: Differential expression (negative binomial)
- clusterProfiler v4.18.0+: ORA and GSEA
- GSVA v2.0.0+: Sample-level pathway scoring
- msigdbr v7.5.1+: MSigDB gene sets
- org.Hs.eg.db v3.20.0: Gene annotation

Safety warnings

⚠️ CAUTION**: Never use TPM/FPKM as input to DESeq2 or edgeR. These tools require raw, unnormalized counts and perform internal normalization.
⚠️ CAUTION**: Verify reference level with levels(dds$sample_type). The first level is your reference. Swapped labels invert all fold changes.
⚠️ CAUTION**: Failing to strip version suffixes (ENSG00000141510.18 -3e ENSG00000141510) causes silent mapping failure. All results will be NA.
⚠️ CAUTION**: Pre-filter the gene list to significant genes before GSEA. The algorithm requires the full ranked list to work correctly.

Before start

Prerequisites:

1. R 3e= 4.4.0 with Bioconductor 3.22
2. At least 16 GB RAM (32 GB recommended)
3. ~20 GB free disk space for TCGA downloads
4. Stable internet connection

⚠️ CAUTION**: If GDCdownload() fails mid-download, delete the partial directory and restart. Do not resume partial downloads.

Procedure

Retrieve TCGA expression data. Download raw counts from the GDC. The returned SummarizedExperiment contains both raw counts (for DE) and TPM (for visualization only).
library(TCGAbiolinks)

query 3c- GDCquery(
  project = "TCGA-LUAD",
  data.category = "Transcriptome Profiling",
  data.type = "Gene Expression Quantification",
  workflow.type = "STAR - Counts"
)
GDCdownload(query, directory = "GDCdata")
se 3c- GDCprepare(query, directory = "GDCdata")
clinical 3c- as.data.frame(colData(se))
_EXPECTED_: ~540 tumor and ~59 normal samples for TCGA-LUAD.
⚠️ CAUTION_: Never use TPM/FPKM as input to DESeq2 or edgeR. These tools require raw, unnormalized counts and perform internal normalization.

Normalization decision. Raw counts go directly to DESeq2 (median-of-ratios normalization internally). For visualization, use vst() after fitting. RSEM expected counts must be rounded with round() before DESeq2. If your data is TPM/FPKM, go back and obtain raw counts.

Differential expression with DESeq2. DESeq2 models count data with a negative binomial distribution. Pre-filtering removes lowly expressed genes to reduce multiple testing burden.
library(DESeq2)
set.seed(42)

counts_mat 3c- assay(se, "unstranded")
col_data 3c- colData(se)
stopifnot(identical(colnames(counts_mat), rownames(col_data)))

dds 3c- DESeqDataSetFromMatrix(
countData = counts_mat, colData = col_data,
design = ~ sample_type
)

keep 3c- rowSums(counts(dds) = 10) = 5
dds 3c- dds[keep, ]
dds$sample_type 3c- relevel(factor(dds$sample_type), ref = "Solid Tissue Normal")
dds 3c- DESeq(dds)

res 3c- results(dds,
name = "sample_type_Primary.Tumor_vs_Solid.Tissue.Normal", alpha = 0.05)
res_shrunk 3c- lfcShrink(dds,
coef = "sample_type_Primary.Tumor_vs_Solid.Tissue.Normal", type = "apeglm")

sig 3c- subset(as.data.frame(res_shrunk), padj  0.05  abs(log2FoldChange)  1)
sig 3c- sig[order(sig$padj), ]
_EXPECTED_: 3,000-5,000 DEGs at padj  0.05 and |LFC|  1.
⚠️ CAUTION_: Verify reference level with levels(dds$sample_type). The first level is your reference. Swapped labels invert all fold changes.

_Note_: For batch effects, use design = ~ batch + sample_type. Do not apply ComBat before DESeq2.

Gene ID conversion. TCGA uses Ensembl IDs; pathway tools need HGNC symbols. Strip version suffixes first.
library(org.Hs.eg.db)

gene_symbols 3c- mapIds(org.Hs.eg.db,
  keys = sub("\\..*", "", rownames(res)),
  keytype = "ENSEMBL", column = "SYMBOL", multiVals = "first")
res$symbol 3c- gene_symbols[sub("\\..*", "", rownames(res))]
res 3c- res[!is.na(res$symbol)  !duplicated(res$symbol), ]
⚠️ CAUTION_: Failing to strip version suffixes (ENSG00000141510.18 - ENSG00000141510) causes silent mapping failure. All results will be NA.

Over-Representation Analysis (ORA). ORA tests whether significant genes are enriched in specific pathways using the hypergeometric test.
library(clusterProfiler)

sig_genes 3c- res$symbol[res$padj  0.05  abs(res$log2FoldChange)  1]
ego 3c- enrichGO(gene = sig_genes, OrgDb = org.Hs.eg.db, keyType = "SYMBOL",
  ont = "BP", pAdjustMethod = "BH", pvalueCutoff = 0.05)

Gene Set Enrichment Analysis (GSEA). Unlike ORA, GSEA uses the complete ranked gene list without any cutoff, making it sensitive to coordinated small changes across a pathway.
gene_list 3c- res_shrunk$log2FoldChange
names(gene_list) 3c- res_shrunk$symbol
gene_list 3c- sort(gene_list[!is.na(names(gene_list))], decreasing = TRUE)
gene_list 3c- gene_list[!duplicated(names(gene_list))]
gsea_go 3c- gseGO(geneList = gene_list, OrgDb = org.Hs.eg.db,
  keyType = "SYMBOL", ont = "BP", minGSSize = 15, maxGSSize = 500,
  pvalueCutoff = 0.05, eps = 0, seed = TRUE)
⚠️ CAUTION_: Never pre-filter the gene list to significant genes before GSEA. The algorithm requires the full ranked list to work correctly.

Sample-level pathway scoring (GSVA). GSVA assigns a pathway activity score per sample, enabling patient-level pathway analysis and subtype discovery.
library(GSVA); library(msigdbr)

h_sets 3c- msigdbr(species = "Homo sapiens", category = "H")
h_list 3c- split(h_sets$gene_symbol, h_sets$gs_name)
expr_mat 3c- assay(vst(dds, blind = FALSE))
gsva_scores 3c- gsva(gsvaParam(expr_mat, h_list), kcdf = "Gaussian")
_EXPECTED_: 50 Hallmark pathways x n_samples matrix. Use kcdf='Gaussian' for continuous data.

Troubleshooting

 100 DEGs: Check design formula and reference level. Verify labels with PCA on vst data.
 10,000 DEGs: Likely batch effects or purity confounding. Add batch to design formula.
Gene mapping returns NA: Strip Ensembl version suffixes: sub('\\..*', '', ids).
GSEA returns nothing: Verify list is sorted decreasingly and not pre-filtered.

Protocol references

Love MI et al. DESeq2. Genome Biology. 2014;15(12):550.
Colaprico A et al. TCGAbiolinks. Nucleic Acids Res. 2016;44(8):e71.
Subramanian A et al. GSEA. PNAS. 2005;102(43):15545-15550.
Hanzelmann S et al. GSVA. BMC Bioinformatics. 2013;14:7.
TCGA Network. Lung adenocarcinoma. Nature. 2014;511:543-550.