This vignette is a condensed version of the documentation pages on the Cicero website. Please check out the website for more details.

Introduction:

The main purpose of Cicero is to use single-cell chromatin accessibility data to predict regions of the genome that are more likely to be in physical proximity in the nucleus. This can be used to identify putative enhancer-promoter pairs, and to get a sense of the overall stucture of the cis-architecture of a genomic region.

Because of the sparsity of single-cell data, cells must be aggregated by similarity to allow robust correction for various technical factors in the data.

Ultimately, Cicero provides a “Cicero co-accessibility” score between -1 and 1 between each pair of accessible peaks within a user defined distance where a higher score indicates higher co-accessibility.

In addition, the Cicero package provides an extension toolkit for analyzing single-cell ATAC-seq experiments using the framework provided by the single-cell RNA-seq analysis software, Monocle. This vignette provides an overview of a single-cell ATAC-Seq analysis workflow with Cicero. For further information and more options, please see the manual pages for the Cicero R package, the Cicero website and our publications.

Cicero can help you perform two main types of analysis:

Installing Cicero

  1. Download the package from Bioconductor.
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
 BiocManager::install("cicero")

Or install the development version of the package from Github.

BiocManager::install(cole-trapnell-lab/cicero)
  1. Load the package into R session.
 library(cicero)

Constructing cis-regulatory networks

Running Cicero

The CellDataSet class

Cicero holds data in objects of the CellDataSet (CDS) class. The class is derived from the Bioconductor ExpressionSet class, which provides a common interface familiar to those who have analyzed microarray experiments with Bioconductor. Monocle provides detailed documentation about how to generate an input CDS here.

To modify the CDS object to hold chromatin accessibility rather than expression data, Cicero uses peaks as its feature data fData rather than genes or transcripts. Specifically, many Cicero functions require peak information in the form chr1_10390134_10391134. For example, an input fData table might look like this:

  site_name chromosome bp1 bp2
chr10_100002625_100002940 chr10_100002625_100002940 10 100002625 100002940
chr10_100006458_100007593 chr10_100006458_100007593 10 100006458 100007593
chr10_100011280_100011780 chr10_100011280_100011780 10 100011280 100011780
chr10_100013372_100013596 chr10_100013372_100013596 10 100013372 100013596
chr10_100015079_100015428 chr10_100015079_100015428 10 100015079 100015428

The Cicero package includes a small dataset called cicero_data as an example.

data(cicero_data)

For convenience, Cicero includes a function called make_atac_cds. This function takes as input a data.frame or a path to a file in a sparse matrix format. Specifically, this file should be a tab-delimited text file with three columns. The first column is the peak coordinates in the form “chr10_100013372_100013596”, the second column is the cell name, and the third column is an integer that represents the number of reads from that cell overlapping that peak. The file should not have a header line.

For example:

     
chr10_100002625_100002940 cell1 1
chr10_100006458_100007593 cell2 2
chr10_100006458_100007593 cell3 1
chr10_100013372_100013596 cell2 1
chr10_100015079_100015428 cell4 3

The output of make_atac_cds is a valid CDS object ready to be input into downstream Cicero functions.

input_cds <- make_atac_cds(cicero_data, binarize = TRUE)

Create a Cicero CDS

Because single-cell chromatin accessibility data is extremely sparse, accurate estimation of co-accessibility scores requires us to aggregate similar cells to create more dense count data. Cicero does this using a k-nearest-neighbors approach which creates overlapping sets of cells. Cicero constructs these sets based on a reduced dimension coordinate map of cell similarity, for example, from a tSNE or DDRTree map.

You can use any dimensionality reduction method to base your aggregated CDS on. We will show you how to create two versions, tSNE and DDRTree (below). Both of these dimensionality reduction methods are available from Monocle (and loaded by Cicero).

Once you have your reduced dimension coordinate map, you can use the function make_cicero_cds to create your aggregated CDS object. The input to make_cicero_cds is your input CDS object, and your reduced dimension coordinate map. The reduced dimension map reduced_coordinates should be in the form of a data.frame or a matrix where the row names match the cell IDs from the pData table of your CDS. The columns of reduced_coordinates should be the coordinates of the reduced dimension object, for example:

  ddrtree_coord1 ddrtree_coord2
cell1 -0.7084047 -0.7232994
cell2 -4.4767964 0.8237284
cell3 1.4870098 -0.4723493

Here is an example of both dimensionality reduction and creation of a Cicero CDS. Using Monocle as a guide, we first find tSNE coordinates for our input_cds:

set.seed(2017)
input_cds <- detectGenes(input_cds)
input_cds <- estimateSizeFactors(input_cds)
input_cds <- reduceDimension(input_cds, max_components = 2, num_dim=6,
                        reduction_method = 'tSNE', norm_method = "none")

For more information on the above code, see the Monocle website section on clustering cells.

Next, we access the tSNE coordinates from the input CDS object where they are stored by Monocle and run make_cicero_cds:

tsne_coords <- t(reducedDimA(input_cds))
row.names(tsne_coords) <- row.names(pData(input_cds))
cicero_cds <- make_cicero_cds(input_cds, reduced_coordinates = tsne_coords)
#> Overlap QC metrics:
#> Cells per bin: 50
#> Maximum shared cells bin-bin: 44
#> Mean shared cells bin-bin: 12.8105263157895
#> Median shared cells bin-bin: 3
#> Warning in make_cicero_cds(input_cds, reduced_coordinates = tsne_coords):
#> On average, more than 10% of cells are shared between paired bins.

Run Cicero

The main function of the Cicero package is to estimate the co-accessiblity of sites in the genome in order to predict cis-regulatory interactions. There are two ways to get this information:

  • run_cicero, get Cicero outputs with all defaults The function run_cicero will call each of the relevant pieces of Cicero code using default values, and calculating best-estimate parameters as it goes. For most users, this will be the best place to start.
  • Call functions separately, for more flexibility For users wanting more flexibility in the parameters that are called, and those that want access to intermediate information, Cicero allows you to call each of the component parts separately. More information about running function separately is available on the package manual pages and on the Cicero website.

The easiest way to get Cicero co-accessibility scores is to run run_cicero. To run run_cicero, you need a cicero CDS object (created above) and a genome coordinates file, which contains the lengths of each of the chromosomes in your organism. The human hg19 coordinates are included with the package and can be accessed with data(“human.hg19.genome”). Here is an example call, continuing with our example data:

data("human.hg19.genome")
sample_genome <- subset(human.hg19.genome, V1 == "chr18")
sample_genome$V2[1] <- 10000000
conns <- run_cicero(cicero_cds, sample_genome, sample_num = 2) # Takes a few minutes to run
#> [1] "Starting Cicero"
#> [1] "Calculating distance_parameter value"
#> [1] "Running models"
#> [1] "Assembling connections"
#> [1] "Done"
head(conns)
#>               Peak1               Peak2   coaccess
#> 2 chr18_10025_10225   chr18_10603_11103  0.8206045
#> 3 chr18_10025_10225   chr18_11604_13986 -0.4051520
#> 4 chr18_10025_10225   chr18_49557_50057 -0.3683652
#> 5 chr18_10025_10225   chr18_50240_50740 -0.3700978
#> 6 chr18_10025_10225 chr18_104385_104585  0.0000000
#> 7 chr18_10025_10225 chr18_111867_112367  0.0000000

Visualizing Cicero Connections

The Cicero package includes a general plotting function for visualizing co-accessibility called plot_connections. This function uses the Gviz framework for plotting genome browser-style plots. We have adapted a function from the Sushi R package for mapping connections. plot_connections has many options, some detailed in the Advanced Visualization section on the Cicero website, but to get a basic plot from your co-accessibility table is quite simple:

data(gene_annotation_sample)
plot_connections(conns, "chr18", 8575097, 8839855, 
                 gene_model = gene_annotation_sample, 
                 coaccess_cutoff = .25, 
                 connection_width = .5, 
                 collapseTranscripts = "longest" )

Comparing Cicero connections to other datasets

Often, it is useful to compare Cicero connections to other datasets with similar kinds of links. For example, you might want to compare the output of Cicero to ChIA-PET ligations. To do this, Cicero includes a function called compare_connections. This function takes as input two data frames of connection pairs, conns1 and conns2, and returns a logical vector of which connections from conns1 are found in conns2. The comparison in this function is conducted using the GenomicRanges package, and uses the max_gap argument from that package to allow slop in the comparisons.

For example, if we wanted to compare our Cicero predictions to a set of (made-up) ChIA-PET connections, we could run:

chia_conns <-  data.frame(Peak1 = c("chr18_10000_10200", "chr18_10000_10200", 
                                    "chr18_49500_49600"), 
                          Peak2 = c("chr18_10600_10700", "chr18_111700_111800", 
                                    "chr18_10600_10700"))

conns$in_chia <- compare_connections(conns, chia_conns)

You may find that this overlap is too strict when comparing completely distinct datasets. Looking carefully, the 2nd line of the ChIA-PET data matches fairly closely to the last line shown of conns. The difference is only ~67 base pairs, which could be a matter of peak-calling. This is where the max_gap parameter can be useful:

conns$in_chia_100 <- compare_connections(conns, chia_conns, maxgap=100)

head(conns)
#>               Peak1               Peak2   coaccess in_chia in_chia_100
#> 2 chr18_10025_10225   chr18_10603_11103  0.8206045    TRUE        TRUE
#> 3 chr18_10025_10225   chr18_11604_13986 -0.4051520   FALSE       FALSE
#> 4 chr18_10025_10225   chr18_49557_50057 -0.3683652   FALSE       FALSE
#> 5 chr18_10025_10225   chr18_50240_50740 -0.3700978   FALSE       FALSE
#> 6 chr18_10025_10225 chr18_104385_104585  0.0000000   FALSE       FALSE
#> 7 chr18_10025_10225 chr18_111867_112367  0.0000000   FALSE        TRUE

In addition, Cicero’s plotting function has a way to compare datasets visually. To do this, use the comparison_track argument. The comparison data frame must include a third columns beyond the first two peak columns called “coaccess”. This is how the plotting function determines the height of the plotted connections. This could be a quantitative measure, like the number of ligations in ChIA-PET, or simply a column of 1s. More info on plotting options in manual pages ?plot_connections and in the Advanced Visualization section of the Cicero website.

# Add a column of 1s called "coaccess"
chia_conns <-  data.frame(Peak1 = c("chr18_10000_10200", "chr18_10000_10200", 
                                    "chr18_49500_49600"), 
                          Peak2 = c("chr18_10600_10700", "chr18_111700_111800", 
                                    "chr18_10600_10700"),
                          coaccess = c(1, 1, 1))

plot_connections(conns, "chr18", 10000, 112367, 
                 gene_model = gene_annotation_sample, 
                 coaccess_cutoff = 0,
                 connection_width = .5,
                 comparison_track = chia_conns,
                 include_axis_track = FALSE,
                 collapseTranscripts = "longest")