First, make sure data is downloaded and that requisite packages are installed.

Navigate to the data folder and run the shell script to get the fragments file (> 100Mb), which will be used for computing gene scores.

Next, run this code block below to verify that everything is installed; if not, run specific commands below to have a new install.

library(Seurat)
library(Signac)
library(chromVAR)
library(motifmatchr)
library(SummarizedExperiment)
library(Matrix)
library(data.table)
library(magrittr)
library(ggplot2)
library(viridis)
library(BSgenome.Hsapiens.UCSC.hg19)
library(EnsDb.Hsapiens.v75)

if(FALSE){
  
  # CRAN packages
  install.packages("Seurat")
  install.packages("magrittr")
  install.packages("ggplot2")
  install.packages("viridis")
  
  # Dev packages
  install.packages("devtools")
  devtools::install_github("timoast/signac")
  
  # Biocondcutor packages
  install.packages("BiocManager")
  BiocManager::install('chromVAR')
   BiocManager::install('motifmatchr')
  BiocManager::install( 'BSgenome.Hsapiens.UCSC.hg19')
  BiocManager::install('EnsDb.Hsapiens.v75')
}

10x QC report

While these installations and downloadds are running, check out the 10x QC report in ../data (also available here). Look over the report and try to understand what the metrics represent.

Questions to think about from the 10x QC report

What are cells with with very high fragment abundance but very low percentage of reads in peaks?
Why is there a ‘bump’ around position ~200 in the TSS enrichment plot?
What fraction of reads map to the mitochondrial genome? Is this lower or higher than we’d expect with ATAC-seq?
Are these libraries sequenced to saturation?
How are PCR duplicates determined in this assay?

For more on point 1, read here, specifically paragraph number 2 of “Cell-type-specific spatial organization of the accessible genome.”

File formats

These are all emitted from a standard 10x CellRanger-ATAC run as well as other droplet-based ATAC processing pipelines. I expect these to be the standard file formats moving forward.

Fragments file

A (perhaps) unfamiliar file format that comes as part of the default 10x scATAC-seq processing is a fragments.tsv file. You can read about it more here. Let’s take alook:

##      chr start   end            barcode nDup
##  1: chr1 10091 10344 TTCGCGTGTCCAAGAG-1    1
##  2: chr1 10096 10309 TACTAGGCAGAATGCG-1    1
##  3: chr1 10126 10351 TCAGTCCGTCCTTATT-1    2
##  4: chr1 10162 10543 ACTACCCAGGTACATA-1    2
##  5: chr1 10163 10407 AACTGGTAGAGGCCTA-1    1
##  6: chr1 10394 10625 CACGTTCAGCCTGTAT-1    1
##  7: chr1 10462 10500 TAAGTGCAGTCGCCTG-1    2
##  8: chr1 10512 10576 CCTCCCTAGGAATGGA-1    2
##  9: chr1 16221 16285 GGCGAAATCACAGGGA-1    2
## 10: chr1 17446 17527 CCTCCCTAGGAATGGA-1    1

A couple of useful tidbits: - The fragments from this file have been offset to account for Tn5 insertions and their asymmetry (i.e. +4 and -5), meaning the edges can be used for footprinting directly. - The last column indicates the number PCR duplications that are observed (with the same barcode). However, fragments that map to multiple barcodes are filtered from this file. - Reads mapping to blacklisted regions are retained and may need to be filtered in some downstream processes. - Chimeric reads or reads with a long insert are removed.

More often than not, I get all of my information from this file for downstream analyses. Though an analogous file does not exist for scRNA-seq, I absolutely recommend becoming familiar and comfortable utilizing and manipulating this file for most purposes (e.g. computing a peaks x cells matrix with a new peak set).

More familiar looking files

The contents of the filtered_peak_bc_matrix matches what is familiar to processing pipelines of scRNA-seq with the exception that the feature set (here: peaks; in RNA-seq, genes) is different between experiments, even with the same reference.

# import files
dir <- "../data/filtered_peak_bc_matrix/"
barcodes <- fread(paste0(dir, "barcodes.tsv"), header = FALSE)[[1]]
peaks <- fread(paste0(dir, "peaks.bed"), header = FALSE, col.names = c("chr", "start", "end"))
mat <- fread(paste0(dir, "matrix.mtx.gz"), skip = 3, header = FALSE)

# Barcodes look like we'd expect
head(barcodes)

## [1] "AAACGAATCGCATAAC-1" "AAACGAATCTGTGTGA-1" "AAACTCGAGAGGAACA-1"
## [4] "AAACTCGAGCCTGTAT-1" "AAACTCGAGCTGAGGT-1" "AAACTGCGTAGCGAGT-1"

# Sparse matrix encoding of data...
head(mat)

##       V1 V2 V3
## 1: 50025  1  1
## 2: 49994  1  2
## 3: 49978  1  4
## 4: 49976  1  2
## 5: 49971  1  2
## 6: 49963  1  4

# Peaks (featureset)... as called by CellRanger
head(peaks)

##     chr  start    end
## 1: chr1 565153 565494
## 2: chr1 569185 569619
## 3: chr1 713583 714646
## 4: chr1 752522 752925
## 5: chr1 762379 763306
## 6: chr1 804969 805600

# Now let's look at the distribution of peak widths...
qplot(peaks$end - peaks$start) + scale_x_log10() + labs(x = "Width of peak (log10)", y = "count")

Notice the wide distribution of the width of peaks– this can yield potential artifacts depending on the exact nature of analysis.

R data structures

Two of the standard objects for analysis of ATAC-seq are a RangedSummarizedExperiment (e.g. for use with chromVAR) and a Seurat object (for use with Signac and Seurat). Here are some custom functions that allow us to see how these data can be imported into R. The overall flow of these functions are quite similar.

import_10xATAC_SE <- function(dir, summary_csv){
  
  # import files
  barcodes <- fread(paste0(dir, "barcodes.tsv"), header = FALSE)[[1]]
  peaks <- fread(paste0(dir, "peaks.bed"), header = FALSE, col.names = c("chr", "start", "end"))
  peaks_gr <- makeGRangesFromDataFrame(peaks)
  matdt <- fread(paste0(dir, "matrix.mtx.gz"), skip = 3, header = FALSE)
  
  # Add an extra 0 to ensure the dimension is like we want in creating a sparse matrix
  mat <- Matrix::sparseMatrix(i = c(matdt[[1]],length(peaks)), 
                              j = c(matdt[[2]],length(barcodes)),
                              x = c(matdt[[3]], 0))
  colnames(mat) <- barcodes
  
  # Create colData using the summary csv
  summary_csv <- data.frame(fread("../data/atac_pbmc_1k_nextgem_singlecell.csv"))
  colData <- merge(data.frame(barcode = barcodes), summary_csv, all.y = FALSE)
  
  # Create a summarized experiment
  SE <- SummarizedExperiment(
    rowRanges = peaks_gr,
    colData = data.frame(barcode = barcodes),
    assays = list(counts = mat)
  )
}

import_10xATAC_seurat <- function(dir, summary_csv){
  
  # import files
  barcodes <- fread(paste0(dir, "barcodes.tsv"), header = FALSE)[[1]]
  peaks <- fread(paste0(dir, "peaks.bed"), header = FALSE, col.names = c("chr", "start", "end"))
  rownames <- paste0(peaks$chr, ":", as.character(peaks$start), "-", as.character(peaks$end))
  matdt <- fread(paste0(dir, "matrix.mtx.gz"), skip = 3, header = FALSE)
  
  # Add an extra 0 to ensure the dimension is like we want in creating a sparse matrix
  mat <- Matrix::sparseMatrix(i = c(matdt[[1]],length(peaks)), 
                              j = c(matdt[[2]],length(barcodes)),
                              x = c(matdt[[3]], 0))
  rownames(mat) <- rownames
  colnames(mat) <- barcodes
  
  # Create the object
  summary_csv <- read.csv("../data/atac_pbmc_1k_nextgem_singlecell.csv", row.names = 1, header = TRUE)
  pbmc2 <- CreateSeuratObject(
    counts = mat,
    assay = 'peaks',
    project = 'ATAC',
    min.cells = 0,
    meta.data = summary_csv
  )
}

seurat_atac <- import_10xATAC_seurat("../data/filtered_peak_bc_matrix/", "../data/atac_pbmc_1k_nextgem_singlecell.csv")
se_atac <- import_10xATAC_SE("../data/filtered_peak_bc_matrix/", "../data/atac_pbmc_1k_nextgem_singlecell.csv")

Comparison of the objects

These objects are quite similar representation of the data. However, some differencecs do exist. Notably, the rowspace of the Summarized Experiment object is a GenomicRanges object, which makes accessing sequences (e.g. motifs; GC content) extremely easy. The Seurat object is a more comprehensive object that provides a hierarchical representation of all elements of the scATAC-seq analysis.

dim(se_atac)

## [1] 50025  1195

dim(seurat_atac)

## [1] 50025  1195

Filtering cells

There are a variety of metrics that one can use to filter cells for downstream analysis. Most commonly, these are used: 1) the total number of high-quality fragments captured for the cell (in this case, the passed_filter number reported by CellRanger); 2) the proportion of fragments that map to an expected accessible chromatin region. Most conveniently, this is usually the fraction of reads in peaks (FRIP); however, other metrics such as proportion of reads overlapping annotated transcription start sites (TSSs) can provide a more consistent interpretation of results from experiment to experiment (but we can also observe that these metrics are highly correlated).

seurat_atac$pct_reads_in_peaks <- seurat_atac$peak_region_fragments / seurat_atac$passed_filters * 100
seurat_atac$pct_reads_in_TSS <- seurat_atac$TSS_fragments / seurat_atac$passed_filters * 100

qplot(log10(seurat_atac$passed_filters),seurat_atac$pct_reads_in_peaks) +
  labs(x = "log10 # of fragments", y = "% reads in peaks") +
  geom_hline(yintercept = 60, color = "firebrick") +
  geom_vline(xintercept = log10(5000), color = "firebrick")

qplot(seurat_atac$pct_reads_in_TSS,seurat_atac$pct_reads_in_peaks) +
  labs(x = "% reads in TSS", y = "% reads in peaks") +
  geom_abline(intercept = 0, slope = 1, linetype = 2, color = "firebrick")

# Filter object
pbmc <- subset(seurat_atac, subset =pct_reads_in_peaks > 60 & passed_filters > 5000)
dim(pbmc)

## [1] 50025   895

Questions to think about

How do we interpret cells below the dotten red line?
As an approximation, how many reads are in enhancers, promoters, and neither?

Dimensionality reduction and clustering with Signac

Utilizing the Signac object, we can call nice wrapper functions to go from our high-dimensional chromatin accessibility data to a two-dimensional cell embedding with possible cell types identified using graph-based clustering.

The TF-IDF (term frequency-inverse document frequency) functionality provides a normalization of the data. While in theory, TF-IDF has a singular interpretation, but in scATAC-seq analyses, there’s been a few modifications of it that have lead to some confusion– see this useful blogpost. The Signac authors provide four such implementations of TF-IDF; these yield roughly the same thing.

# Normalization
# there winds up being very little differences in the four methods from my evaluation
pbmc <- RunTFIDF(pbmc, method = 1)

After completing this normalization, we can perform subsequent steps like finding variable features, linear dimension reduction via SVD (similar to PCA), graph-based clustering, and two-dimensional manifold embedding.

Note: This code chunk takes about 2 minutes on my machine.

# Filter for variable feature
pbmc <- FindTopFeatures(pbmc, min.cutoff = 'q50')

# Run linear dimension reduction (very similar to PCA)
pbmc <- RunSVD(
  object = pbmc,
  assay = 'peaks',
  reduction.key = 'LSI_',
  reduction.name = 'lsi'
)

# Commands for a 2D embedding and graph-based clustering
pbmc <- RunUMAP(object = pbmc, reduction = 'lsi', dims = 1:30)
pbmc <- FindNeighbors(object = pbmc, reduction = 'lsi', dims = 1:30)
pbmc <- FindClusters(object = pbmc, verbose = FALSE)
DimPlot(object = pbmc, label = TRUE) + NoLegend()

Here, we see that this approach nicely picks out ~8-9 clusters using a standard workflow.

Compute gene scores using Signac

There have been several proposed methods for computing gene scores from scATAC-seq data. The basic idea is that we want to count accessibel chromatin fragments proximal to annotated gene bodies. These fragments may or may not be weighted. When they are, typical weights include correlation to the promoter of the target gene or the distance from the center of the fragment to the TSS. The most sophisticaed is Cicero, which utilizes a penalized correlation metric between enhancers and genes to compute these gene scores. What turns out to be the most effective (in my opinion and has been shown in some supplemental figures) is a basic idea introducted in Seurat/Signac to just count all fragments overlapping the gene body extended by 2kb (to include the promoter). This is computationally very efficient to compute. Additionally, this doesn’t rely on peaks for inference, which is very advantageous for merging datasets.

In the code chunk below, we use Signac to compute gene scores. The input requires the fragments file, discussed above.

Note: This code chunk takes about 3-4 minutes on my machine.

gene.coords <- genes(EnsDb.Hsapiens.v75, filter = ~ gene_biotype == "protein_coding")
seqlevelsStyle(gene.coords) <- 'UCSC'
genebody.coords <- keepStandardChromosomes(gene.coords, pruning.mode = 'coarse')
genebodyandpromoter.coords <- Extend(x = gene.coords, upstream = 2000, downstream = 0)

# create a gene by cell matrix
gene.activities <- FeatureMatrix(
  fragments = "../data/atac_pbmc_1k_nextgem_fragments.tsv.gz",
  features = genebodyandpromoter.coords,
  cells = colnames(pbmc),
  chunk = 5
)

# convert rownames from chromsomal coordinates into gene names
gene.key <- genebodyandpromoter.coords$gene_name
names(gene.key) <- GRangesToString(grange = genebodyandpromoter.coords)
rownames(gene.activities) <- gene.key[rownames(gene.activities)]

# add the gene activity matrix to the Seurat object as a new assay, and normalize it
pbmc[['RNA']] <- CreateAssayObject(counts = gene.activities)
pbmc <- NormalizeData(
  object = pbmc,
  assay = 'RNA',
  normalization.method = 'LogNormalize',
  scale.factor = median(pbmc$nCount_RNA)
)

By saving the gene score (pseudo-expression) counts in the RNA slot of the Seurat object, this structure is ready for integration with scRNA-seq data (see vignette URL at the end of the tutorial).

Dimensionality reduction with chromVAR

In contrast to the LSI-based dimension reduction used in Signac, we can compute meta-features that have an interpretable biological meaning via the chromVAR algorithm. More explicitly, chromVAR utilizes pre-defined transcription factor motifs to determine whether we observe more (positive z-score) or less (negative z-score) accessibility per cell than we’d expect under a null model. This approach is very useful for identifying master regulators of specific cell types and has been used in the past to inform dimension reduction and separation of cell types.

The series of functions used in chromVAR are to 1) annotate peaks with GC content as part of the background model; 2) identify TF motifs in the peak set; 3) idnetify variable motifs above some cutoff; 4) perform a final dimension reduction into a two-dimensional space (via t-SNE). These steps are shown below:

Note: this code chunk takes 2-3 minutes on my machine

# Subset down to similar cells and filter peaks for chromVAR
se_filtered <- filterPeaks(se_atac[,colnames(pbmc)], non_overlapping = TRUE)

# Import the motifs from JASPAR 2016 for humans (default)
motifs <- getJasparMotifs()

# Add the GC content of each peak to the summarized experiment object
se_filtered <- addGCBias(se_filtered, 
                            genome = BSgenome.Hsapiens.UCSC.hg19)

# Assemble
motif_ix <- matchMotifs(motifs, se_filtered, 
                        genome = BSgenome.Hsapiens.UCSC.hg19)
dev <- computeDeviations(object = se_filtered, annotations = motif_ix)

# Compute variability of
variability <- computeVariability(dev)
plotVariability(variability, use_plotly = TRUE)

# Perform a 2D embedding (note: UMAP is not internall supported in chromVAR)
tsne_results <- deviationsTsne(dev, threshold = 1.5)

dim(se_filtered)

## [1] 49668   895

dim(motif_ix)

## [1] 49668   386

dim(dev)

## [1] 386 895

Questions to think about from chromVAR execution

What do the dimensions of these objects at the end of the code chunk represent?
The BSgenome.Hsapiens.UCSC.hg19 genome object is passed to both addGCBias and matchMotifs. What purpose is the object serving in each function?
What would happen if we supplied the wrong BSgenome object?
What is the most variable TF motif in this sample? Does this make biological sense?

Compare two with marker gene annotations for NK cells

To compare the efficacy of the two approaches (chromVAR, Signac), we will visualize the reduced dimensions (tSNE, UMAP) for the same population of cells (enforced above) and annotate cells with known marker genes for different cell types of PBMCs. Here, we will examine NCR1, a marker gene for Natural Killer cells.

chromvar_df <- data.frame(
  tsne_results,
   t(pbmc@assays[["RNA"]]@counts)
)

signac_df <- data.frame(
  pbmc@reductions$umap@cell.embeddings, 
  t(pbmc@assays[["RNA"]]@counts)
)

ggplot(chromvar_df, aes(x= X1, y = X2, color = log2(NCR1 + 1))) +
  geom_point() + scale_color_viridis() +
  labs(x = "chromVAR tSNE1", y = "chromVAR tSNE2")

ggplot(signac_df, aes(x= UMAP_1, y = UMAP_2, color =  log2(NCR1 + 1))) +
  geom_point() + scale_color_viridis() +
  labs(x = "Signac UMAP1", y = "Signac UMAP2")

Note: we can also look for other PBMC marker genes with these data frames (e.g. PAX5)

ggplot(chromvar_df, aes(x= X1, y = X2, color = log2(PAX5 + 1))) +
  geom_point() + scale_color_viridis() +
  labs(x = "chromVAR tSNE1", y = "chromVAR tSNE2")

ggplot(signac_df, aes(x= UMAP_1, y = UMAP_2, color =  log2(PAX5 + 1))) +
  geom_point() + scale_color_viridis() +
  labs(x = "Signac UMAP1", y = "Signac UMAP2")

From this, note the following:

Gene scores are an independent annotation function (i.e. it’s not circular that the color co-localizes with specific parts of a cluster).
chromVAR successfully separates the major distinct celltypes (B-cells, T-cells, and Myeloid cells) in peripheral blood. However, it has limited resolution to identify other populations that require a more refined understanding (T-cells).
The Signac workflow (specifically, using LSI) recovers the NK T-cell cluster more distinctly.

Questions to consider

What are the feature sets for each workflow? Does either one indicate intrinsically
Under what circumstances could chromVAR give ‘cleaner’ signal than the LSI workflow?
Why do we log-scale the gene scores for visualization?
What is the specificity of the features used in gene scores compared to transcripts in scRNA-seq? Are there scenarios where one may be preferred over the other?

Summary

Key points to remember and review from this exercise:

Importing data into R and applicable data structures for scATAC-seq
How to interpret QC parameters from the 10x scATAC report
Use of the fragments file in analyses (e.g. gene scoring) and other potential applications
Parameters for filtering cells
How to perform dimensionality reduction using different packages (Signac / chromVAR) and what drives the dimension reduction in each technique

Additional resources

Integration and label transfering with RNA-seq / Signac. This also discusses how to generate accessible chromatin tracks by simply extending the workflow here.
Cicero provides functionality for pseudotime, enhancer-gene connections, and much more.
Batch correction and integration / Signac
Andrew Hill has a very useful blogpost that constrasts possible versions of LSI.
Dimension reduction and analyses with CisTopics. The tool works well but is relatively slow and hard to use with larger datasets.

Above, I’ve ranked these roughly in terms of how useful I could envision them being used for a new user.

Other useful tools

SnapATAC provides a computationally efficient processing of data and many useful functionalities.
EpiScanpy is a new tool building off of the popular python tool Scanpy for single-cell epigenomics data analysis.
HINT software for TF footprinting analyses. (currently undemonstrated with single-cell data)
Autoencoder framework for scATAC-seq

Acknowledgements

Many parts of this exercise relied heavily on functions and vignettes created by Tim Stuart, the creator of the Signac package and Alicia Schep, the creator of the chromVAR package.

Contact: email Caleb, instructor

Analysis single-cell ATAC-seq data

CSHL Advanced Sequencing Technologies

2019-11-11