The goal of these analyses are to identify enriched transcription factors / histone mods within a subset of genomic loci identified through whatever means (e.g. ATAC-Seq, ChIP-Seq). For example, consider a set of loci that were identified through 3 different ChIP-enrichments–
A.bed
B.bed
C.bed
A question of interest may be “what epigenomic/transcription factor features are enriched in A
relative to the others?”. The key here is that we must define a “universe” of genomic loci / peaks (in this example– the union of A
, B
, and C
) and then interrogate a subset (e.g. A
) for enrichment with various transcription factor binding loci and aggregation of histone modifications.
Other examples of sensible analyses in this framework would include (but are certainly not limited to)– “what is enriched in my (subsetted) list of differential peaks relative to all peaks?” or “what is enriched in my (subsetted) list of peaks that overlap a GWAS variant relative to all peaks?” A brief description of how to run this framework is described below.
hg38
and mm10
. If your data is aligned to a different reference genome, you’ll need to either realign or do a liftover. I have an Rpackage that makes this easyThe analyses are really straight-forward thanks to some really nice tools. Namely, the Cistrome database was used as a collection of “all the ChIP-Seq tracks on GEO”. Next, for the enrichment analyses, we use LOLA, which provides a relatively simply yet elegant implementation of Fisher’s exact test (math shown here) to perform the enrichment. I’ve used some data caching and a couple of other organization schemes to make the analysis run nicely and quickly, but these tools / authors do all the heavily lifiting and should be recognized / cited as such.
You’ll need to load a relatively recent version of R
into your PATH
. I found the following module to seemingly work fine.
module load stats/R/3.3.1
Next, you’ll need to install the singular R
dependency package, which is LOLA
. You can install LOLA through Bioconductor, or simply copy/paste the following code into an interactive R
session. Note: on Orchestra, R packages are maintained through individual user library instances, so each user will have to install LOLA
individually.
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("LOLA")
Currently, the folder with everything pre-computed is sitting here for Orchestra–
/n/scratch2/cal52/bedEnrichment
but it may move.
With the dependencies installed and the pre-computed file structure wrangled, the analysis can be executed using the following framework. A simple R script showing enrichment for an example factor (BAF155) within a universe (all BAF peaks) is shown here–
library(LOLA)
background_path <- "input/BAF_universe.bedover"
test_path <- "input/EOL1_WT_BAF155_peaks.bedover"
cores <- 8
regionDB <- readRDS("../hg38_hist/Histone_human_Cistrome/hg38_histone_cistrome.rds")
background <- readBed(background_path)
res <- runLOLA(readBed(),background,regionDB,cores=cores)
writeCombinedEnrichment(res, outFolder= "EOL1_BAF155_Histone")
Key points of this analysis script:
LOLA
package which is hopefully installed by now..bed
(or related file) of the “universe” of peaks..bed
(or related file) of the “test” peaks..rds
files that I’ve pre-computed and cached. The LOLA
documentation describes what happens in this step. For each of the databases that I’ve set up, it took > 2 days to precompute everything, which is why the caching is so valuable..tsv
file.The location/file path of all the cached data for mouse/human enrichments can be found here and what’s what should be self-explanatory–
hg38_hist/Histone_human_Cistrome/hg38_histone_cistrome.rds
hg38_tf/TF_human_Cistrome/hg38_TF_cistrome.rds
mm10_hist/Histone_mouse_Cistrome/mm10_histone_cistrome.rds
mm10_tf/TF_mouse_Cistrome/mm10_TF_cistrome.rds
Note: there are regions.tar.gz
that contain the actual .bed/.narrowPeak
files that were at some point read in and processed before I made these cached versions in the rds
file.
The sample Rscript
that I used to run the analyses for the BAF peaks can be found here:
/n/scratch2/cal52/bedEnrichment/BAF_EOL1/allEnrichments.R
Finally, to run the enrichment analyses, submit the Rscript
to the mcore
queue using a command like the following (untested but should probably work)–
bsub -q mcore -n 8 Rscript allEnrichments.R
Obviously, there are impossibly many errors to anticipate, but here are some potential issues that I could imagine would cause an analysis to pause or be unsuccessful or unintelligible:
chr
appended to them