Aug 12, 2020
Works for me
14

# CUT&Tag Data Processing and Analysis Tutorial eLifeIn 1 collection

• 1Fred Hutchinson Cancer Research Center
Abstract
This tutorial is designed for processing and analyzing CUT&Tag data following the Benchtop CUT&Tag V.3 protocol. The illustration data used in this tutorial is the profiling of histone modifications in the human lymphoma K562 cell line, but the tutorial is generally applicable to any chromatin protein, including transcription factors, RNA polymerase II, and epitope-tagged proteins. For reproducible analysis, this tutorial is also available on GitHub at https://yezhengstat.github.io/CUTTag_tutorial/.
Protocol Citation
Ye Zheng, Kami Ahmad, Steven Henikoff 2020. CUT&Tag Data Processing and Analysis Tutorial. protocols.iohttps://dx.doi.org/10.17504/protocols.io.bjk2kkye
MANUSCRIPT CITATIONplease remember to cite the following publication along with this protocol
Henikoff S, Henikoff JG, Kaya-Okur HS, Ahmad K, Efficient chromatin accessibility mapping in situ by nucleosome-tethered tagmentation. eLife doi: 10.7554/eLife.63274
Keywords
CUT&Tag, Data Processing, Analysis, Quality Control
This is an open access protocol distributed under the terms of the Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Created
Aug 11, 2020
Aug 19, 2020
PROTOCOL integer ID
40314
Parent protocols
Part of collection
2020 Featured Protocols

I. Introduction

1
Overview of CUT&Tag

All dynamic processes that take place on DNA in the eukaryotic nucleus occur in the context of a chromatin landscape that comprises nucleosomes and their modifications, transcription factors, and chromatin-associated complexes. A variety of chromatin features mark sites of activating and silencing transcriptional regulatory elements and chromatin domains that differ between cell types and change during development.

The mapping of chromatin features genome-wide has traditionally been performed using chromatin immunoprecipitation (ChIP), in which chromatin is cross-linked and solubilized, and an antibody to a protein or modification of interest is used to immunoprecipitate the bound DNA (Fig. 1a). Very little has changed in the way ChIP is most generally performed since it was first described 35 years ago, and remains fraught with signal-to-noise issues and artifacts. An alternative chromatin profiling strategy is enzyme tethering in situ whereby the chromatin protein or modification of interest is targeted by an antibody or fusion protein. Then, the underlying DNA is marked or cleaved, and a succession of enzyme-tethering methods have been introduced over the past two decades. Cleavage Under Targets & Tagmentation (CUT&Tag) is a tethering method that uses a protein-A-Tn5 (pA-Tn5) transposome fusion protein (Fig. 1b). In CUT&Tag, permeabilized cells or nuclei are incubated with antibody to a specified chromatin protein, and then pA-Tn5 loaded with mosaic end adaptors is successively tethered to antibody-bound sites. Activation of the transposome by adding magnesium ions results in the integration of the adaptors into nearby DNA. These are then amplified to generate sequencing libraries. Antibody-tethered Tn5-based methods achieve high sensitivity owing to stringent washing of samples after pA-Tn5 tethering and the high efficiency of adaptor integration. The improved signal-to-noise relative to ChIP-seq translates to an order-of-magnitude reduction in the amount of sequencing required to map chromatin features, allowing sample pooling (typically up to 90 samples) for paired-end sequencing on Illumina NGS sequencers by barcoded PCR of libraries.

Figure 1. Differences between immunoprecipitation and antibody-targeted chromatin profiling strategies. A.ChIP-seq experimental procedure. B. CUT&Tag experimental procedure. Cells and nuclei are indicated in grey, chromatin as red nucleosomes, and a specific chromatin protein in green.

2
Objectives

This tutorial is designed for processing and analyzing CUT&Tag data following the Benchtop CUT&Tag V.3 protocol. The illustration data used in this tutorial is the profiling of histone modifications in the human lymphoma K562 cell line, but the tutorial is generally applicable to any chromatin protein, including transcription factors, RNA polymerase II, and epitope-tagged proteins.
3
CUT&Tag data processing and analysis outline
Figure 2. CUT&Tag data processing and analysis.

4
Requirements

• Linux system
• R (versions >= 3.6)
- dplyr
- stringr
- ggplot2
- viridis
- GenomicRanges
- chromVAR
- DESeq2
- ggpubr
- corrplot
- ChIPseqSpikeInFree [Optional]

• FastQC(version >= 0.11.9) [Optional]
• Bowtie2 (version >= 2.3.4.3)
• samtools (version >= 1.10)
• bedtools (version >= 2.29.1)
• Picard (version >= 2.18.29)
• SEACR (version >= 1.3)
• deepTools (version >= 2.0)
5

In this tutorial, we use data from Kaya-Okur et al. (2020), and available for download from GEOThe corresponding SRA entries are provided below.

- a. Using SRA Toolkit

- b. Download through European Nucleotide Archive. New ENA Browser: https://www.ebi.ac.uk/ena/browser/view. We are using this option as an illustration.

Data accession on GEO:

• H3K27me3:
- SH_Hs_K27m3_NX_0918 as replicate 1: GEO accession: GSE145187, SRA entry: SRX8754646
- SH_Hs_K27m3_Xpc_0107 as replicate 2: GEO accession: GSE145187, SRA entry: SRX7713678

• H3K4me3:
- SH_Hs_K4m3_NX_0918 as replicate 1: GEO accession: GSE145187, SRA entry: SRX7713692
- SH_Hs_K4m3_Xpc_0107 as replicate 2: GEO accession: GSE145187, SRA entry: SRX7713696

• IgG:
- SH_Hs_IgG_1x_0924 as replicate 1:GEO accession: GSE145187, SRA entry: SRX8468909
- SH_Hs_IgG_20181224 as replicate 2: GEO accession: GSM3680227, SRA entry: SRX5545346

First, we need to specify the project path.

Taking SH_Hs_IgG_20181224 as an example.

II. Data Pre-processing

6
Quality Control using FastQC [Optional]

This step is not required. In case that users are generating their own data and FastQC is one of the routine checking procedures in the user's' groups, we provide this step as a troubleshooting explanation.

1. Obtain FastQC

2. Run FastQC for quality check

3 Interpret the quality check results.

The discordant sequence content at the beginning of the reads is a common phenomenon for CUT&Tag reads. Failing to pass the Per base sequence content does not mean your data failed.
Figure 3. Per base sequence content fails the FastQC quality check.
- It can be due to the Tn5 preference.

- What you might be detecting is the 10-bp periodicity that shows up as a sawtooth pattern in the length distribution. If so, this is normal and will not affect alignment or peak calling. In any case, we do not recommend trimming as the bowtie2 parameters that we list will give accurate mapping information without trimming.
7
Merge technical replicates/lanes if needed [Optional]

Sometimes, samples are often sequenced across multiple lanes for efficiency and can be pooled before alignment. If you want to check the reproducibility between sequences of different lanes of the same sample, you can skip this step and align each sequencing file (fastq file) respectively.

III. Alignment

8

The structure of CUT&Tag insert libraries with Tn5 adapters and barcoded PCR primers is shown below:
Figure 4. CUT&Tag insert libraries with the sequence of adapters.
Our standard pipeline is to perform single-index 25x25 PE Illumina sequencing on up to 90 pooled samples on a single HiSeq 2500 flowcell, where each sample has a unique PCR primer barcode. Amounts for each library are adjusted to provide ~5 million paired-end reads, which provides high-quality profiling for abundant chromatin features with a specific and high-yield antibody. Less abundant features typically require fewer reads, while lower-quality antibodies may increase the number of reads needed for generating robust chromatin profiles. A thorough discussion of feature recall and sequencing depths for CUT&Tag has been published (Kaya-Okur et al 2020).
9
Bowtie2 alignment

Alignment to hg38.

The paired-end reads are aligned by Bowtie2 using parameters
for mapping of inserts 10-700 bp in length.

Critical step: There is no need to trim reads from out standard 25x25 PE sequencing, as adapter sequences will not be included in reads of inserts >25 bp. However, for users performing longer sequencing, reads will need to be trimmed by Cutadapt and mapped by
to ignore any remaining adapter sequence at the 3' ends of reads during mapping.
10
Alignment to spike-in genome for spike-in calibration [optional/recommended]

This step is optional but recommended depending on your experimental protocol.

E. coli DNA is carried along with bacterially-produced pA-Tn5 protein and gets tagmented non-specifically during the reaction. The fraction of total reads that map to the E.coli genome depends on the yield of epitope-targeted CUT&Tag, and so depends on the number of cells used and the abundance of that epitope in chromatin. Since a constant amount of pATn5 is added to CUT&Tag reactions and brings along a fixed amount of E. coli DNA, E. coli reads can be used to normalize epitope abundance in a set of experiments. For more discussion, please see Section V.

- For spike-in normalization, reads are aligned to the E. coli genome U00096.3 with two more parameters --no-overlap and --no-dovetail

to avoid possible cross-mapping of the experimental genome to that of the carry-over E. coli DNA that is used for calibration.

11
Alignment summary

For more detailed parameters explanation, users can refer to the bowite2 manual

Bowtie2 alignment results summary is saved at
and you should expect the results look similar.

- 2984640 is the sequencing depth, i.e., the total number of paired reads.
- 125110 is the number of read-pairs that fail to be mapped.
- 2360430 + 499090 is the number of read-pairs that are successfully mapped.
- 95.81% is the overall alignment rate

III. Alignment Summary

12
Report sequencing mapping summary

Summarize the raw reads and uniquely mapping reads to report the efficiency of alignment. Alignment frequencies are expected to be >80% for high-quality data. CUT&Tag data typically has very low backgrounds, so as few as 1 million mapped fragments can give robust profiles for a histone modification in the human genome. Profiling of less-abundant transcription factors and chromatin proteins may require 10 times as many mapped fragments for downstream analysis.

We can evaluate the following metrics:

- Sequencing depth
- Alignment rate
- Number of mappable fragments
- Duplication rate
- Unique library size
- Fragment size distribution
12.1
1. Sequencing depth

2. Spike-in alignment

3. Summarize the alignment to hg38 and E.coli

4. Visualizing the sequencing depth and alignment results.

In a typical CUT&Tag experiment targeting the abundant H3K27me3 histone modification in 65,000 K562 cells, the percentage of E. coli reads range from ~0.01% to 10%. With fewer cells or less abundant epitopes, E. coli reads can comprise as much as 70% of the total mapped reads. For IgG controls, the percentage of E. coli reads is typically much higher than that for an abundant histone modification.

12.2
Remove duplicates [optional]

CUT&Tag integrates adapters into DNA in the vicinity of the antibody-tethered pA-Tn5, and the exact sites of integration are affected by the accessibility of surrounding DNA. For this reason, fragments that share exact starting and ending positions are expected to be common, and such ‘duplicates’ may not be due to duplication during PCR. In practice, we have found that the apparent duplication rate is low for high-quality CUT&Tag datasets, and even the apparent ‘duplicate’ fragments are likely to be true fragments. Thus, we do not recommend removing the duplicates. In experiments with very small amounts of material or where PCR duplication is suspected, duplicates can be removed. The following commands show how to check the duplication rate using Picard.

We summarize the apparent duplication rate and calculate the unique library size without duplicates.

- In these example datasets, the IgG control samples have relatively high duplication rates, since reads in this sample derive from non-specific tagmentation in the CUT&Tag reactions. Therefore, it is appropriate to remove the duplicates from the IgG datasets before downstream analysis.

- The estimated library size is the estimated number of unique molecules in the library based on PE duplication calculated by Picard.

- The estimated library sizes are proportional to the abundance of the targeted epitope and to the quality of the antibody used, while the estimated library sizes of IgG samples are expected to be very low.

- The unique fragment number is calculated by the MappedFragNum_hg38 * (1-DuplicationRate/100).

12.3
Assess mapped fragment size distribution

CUT&Tag inserts adapters on either side of chromatin particles in the vicinity of the tethered enzyme, although tagmentation within chromatin particles can also occur. So, CUT&Tag reactions targeting a histone modification predominantly results in fragments that are nucleosomal lengths (~180 bp), or multiples of that length. CUT&Tag targeting transcription factors predominantly produce nucleosome-sized fragments and variable amounts of shorter fragments, from neighboring nucleosomes and the factor-bound site, respectively. Tagmentation of DNA on the surface of nucleosomes also occurs, and plotting fragment lengths with single-basepair resolution reveal a 10-bp sawtooth periodicity, which is typical of successful CUT&Tag experiments.

- The smaller fragments (50-100 bp) can be due to that tethered Tn5 can tagment on the surface of a nucleosome as well as in linker regions, so the small fragments might not be the background.

12.4
Assess replicate reproducibility

Data reproducibility between replicates is assessed by correlation analysis of mapped read counts across the genome. For the simplicity of implementation, we will postpone this analysis after Section IV when the file format has been converted into fragment bed files.

IV. Alignment filtering and file format conversion

13
Filtering mapped reads by the mapping quality filtering [optinal]

Some projects may require more stringent filtering on the alignment quality score. This blog detailedly discussed how does bowtie assign quality score with examples.

MAPQ(x) = -10 * log10P(x is mapped wrongly) = -10 * log10(p)

which ranges from 0 to 37, 40 or 42.

will eliminate all the alignment results that are below the minQualityScore defined by user.

- If you do implement this filtering, please replace the ${histName}_bowtie2.sam in the following steps by this filtered sam file${histName}_bowtie2.qualityScore\$minQualityScore.sam.

14
File format conversion

This section is required in preparation for the peak calling and visualization where there are a few filtering and file format conversion that need to be done.

15
Assess replicate reproducibility (continue step 12.4)

To study the reproducibility between replicates and across conditions, the genome is split into 500 bp bins, and a Pearson correlation of the log2-transformed values of read counts in each bin is calculated between replicate datasets. Multiple replicates and IgG control datasets are displayed in a hierarchically clustered correlation matrix.

V. Spike-in calibration

16
Spike-in calibration

This section is optional but recommended depending on your experimental protocol. We have shown the alignment to the spike-in genome in step 10 and the spike-in alignment summary in step 12.

The underlying assumption is that the ratio of fragments mapped to the primary genome to the E. coli genome is the same for a series of samples, each using the same number of cells. Because of this assumption, we do not normalize between experiments or between batches of purified pATn5, which can have very different amounts of carry-over E. coli DNA. Using a constant C to avoid small fractions in normalized data, we define a scaling factor S as

Normalized coverage is then calculated as:

The Constant is an arbitrary multiplier, typically 10,000. The resulting file will be comparatively small as a genomic coverage bedGraph file.

17
Scaling factor

VI. Peak calling

18
SEACR

The Sparse Enrichment Analysis for CUT&RUN, SEACR package is designed to call peaks and enriched regions from chromatin profiling data with very low backgrounds (i.e., regions with no read coverage) that are typical for CUT&Tag chromatin profiling experiments. SEACR requires bedGraph files from paired-end sequencing as input and defines peaks as contiguous blocks of basepair coverage that do not overlap with blocks of background signal delineated in the IgG control dataset. SEACR is effective for calling both narrow peaks from factor binding sites and broad domains characteristic of some histone modifications. The description of the method is published at Meers et al. 2019 and the user’s manual is available on GitHub at Since we have normalized fragment counts with the E. coli read count, we set the normalization option of SEACR to “non”. Otherwise, the “norm” is recommended.

19
Number of peaks

20
Reproducibility of the peak across biological replicates

Peak calling on replicate datasets is compared to define reproducible peaks. The top 1% of peaks (ranked by total signal in each block) are selected as high-confidence sites.

The reproducibility is calculated by

# peaks overlapping rep1 and rep2/# peaks of rep1 or rep2 * 100

Therefore, it is sensitive to the total number of peaks called in each replicate.
21
FRagment proportion in Peaks regions (FRiPs)

We calculate the fraction of reads in peaks (FRiPs) as a measure of signal-to-noise and contrast it to FRiPs in the IgG control dataset for illustration. Although sequencing depths for CUT&Tag are typically only 1-5 million reads, the low background of the method results in high FRiP scores.

22
Visualization of peak number, peak width, peak reproducibility and FRiPs

VII. Visualization

23
Visualization

Typically we are interested in visualizing a chromatin landscape in regions using a genome browser. The Integrative Genomic Viewer provides a web app version and a local desktop version that is easy to use. The UCSC Genome Browser provides the most comprehensive supplementary genome information.

Browser display of normalized bedGraph files:
Figure 5. IgV Web Visualization around region chr7:131,000,000-134,000,000.

24
Heatmap visualization of specific regions

We are also interested in looking at chromatin features at a list of annotated sites, for example, histone modification signal at gene promoters. We will use the computeMatrix and plotHeatmap functions from deepTools to generate the heatmap.

25
Heatmap over transcription units

Figure 6. Heatmap of histone enrichment around genes.

26
Heatmap on CUT&Tag peaks

We use the midpoint of the signal block returned from SEACR to align signals in heatmaps. The sixth column of the SEACR output is an entry in the form chr:start-end that represents the first and ending bases of the region with the maximum signal of the region. We first generate a new bed file containing this midpoint information in column 6 and use deeptools for the heatmap visualization.

Figure 7. Heatmap of histone enrichment in peaks.

VIII. Differential analysis

27
DESeq2

Estimate variance-mean dependence in count data from high-throughput sequencing assays and test for differential expression based on a model using the negative binomial distribution.
28
Create the peak \times sample matrix.

Usually, the differential tests compare two or more conditions of the same histone modification. In this tutorial, limited by the demonstration data, we will illustrate the differential detection by comparing two replicates of H3K27me3 and two replicates of H3K4me3. We will use DESeq2 (complete tutorial) as an illustration.

Create a master peak list merging all the peaks called for each sample.

Get the fragment counts for each peak in the master peak list.

29
Sequencing depth normalization and differential enriched peaks detection

- DESeq2 requires the input matrix should be un-normalized counts or estimated counts of sequencing reads.

- DESeq2 model internally corrects for library size.

- countMatDiff summarizes the differential analysis results:
- First 4 columns: raw reads counts after filtering the peak regions with low counts
- Second 4 columns: normalized read counts eliminating library size difference.
- Remaining columns: differential detection results.

30
ChIPseqSpikeInFree for normalizing data without spike-in DNA [Optional]

ChIPseqSpikeInFree: a ChIP-seq normalization approach to reveal global changes in histone modifications without spike-in is a novel ChIP-seq normalization method to effectively determine scaling factors for samples across various conditions and treatments, which does not rely on exogenous spike-in chromatin or peak detection to reveal global changes in histone modification occupancy. The installation details can be found on GitHub at https://github.com/stjude/ChIPseqSpikeInFree.

31
Other peak calling methods.

32
Other packages for differential analysis of binding sites

Limma is an R package for the analysis of gene expression microarray data, especially
the use of linear models for analysing designed experiments and the assessment of differential expression. Limma provides the ability to analyse comparisons between many
RNA targets simultaneously in arbitrary complicated designed experiments. Empirical
Bayesian methods are used to provide stable results even when the number of arrays
is small. Limma can be extended to study differential fragment enrichment analysis within peak regions. Notably, limma can deal with both fixed effect model and random effect model.

Differential expression analysis of RNA-seq expression profiles with biological replication. Implements a range of statistical methodology based on the negative binomial distributions, including empirical Bayes estimation, exact tests, generalized linear models, and quasi-likelihood tests. As well as RNA-seq, it be applied to thedifferential signal analysis of other types of genomic data that produce read counts, including ChIP-seq, ATAC-seq, Bisulfite-seq, SAGE and CAGE. edgeR can deal with multifactor problem.

33
This workflow can be followed with your own data and will generate a standardized set of quality-control reports. However, many sequencing facilities do not perform 25x25 PE sequencing, and alternate parameters for trimming and mapping are provided here. Control datasets for non-specific antibody (IgG) profiling or ATAC-seq profiling of your material can also be used for optional analysis detailed here.

Stringent washing with 300 mM NaCl is critical to limit the affinity of Tn5 for exposed DNA. We describe here the need for controlling background Tn5 affinities and describe how our CUT&Tag protocol effectively suppresses this artifact for unambiguous mapping of chromatin epitopes. We present a protocol that can process either native or fixed nuclei and includes alternative methods for DNA isolation. To illustrate the method, we describe a typical experiment, including evaluation of the results using a new metric for peak-calling information. Further, we validate a single-tube format for CUT&Tag that requires no DNA isolation but instead uses tagmented material directly for library amplification. We document critical steps for the CUT&Tag protocol, informed by our experiences, helping users establish this method in their research.

Reference

34

Kaya-Okur HS, Wu SJ, Codomo CA, Pledger ES, Bryson TD, Henikoff JG, Ahmad K, Henikoff S: CUT&Tag for efficient epigenomic profiling of small samples and single cells. Nature Communications 2019 10:1930 (PMID:31036827).

Meers, M.P., Tenenbaum, D. & Henikoff, S. Peak calling by Sparse Enrichment Analysis for CUT&RUN chromatin profiling. Epigenetics & Chromatin 12, 42 (2019). https://doi.org/10.1186/s13072-019-0287-4

Cite this tutorial
Zheng Y et al (2020). Protocol.io