Contents

1 Introduction

gmoviz logo

This vignette will guide you through the more advanced uses of gmoviz, such as the incremental apporach to generating plots and making finer modifications. It is highly recommended that you have read the basic overview of gmoviz before this vignette.

2 Incremental plotting steps

As well as high-level functions functions, gmoviz contains many lower-level functions that can be used to construct a plot track-by-track for more flexibility.

2.1 Dataset

This section will use the rBiocpkg("pasillaBamSubset") package for example data, so please ensure you have it installed before proceeding:

if (!require("pasillaBamSubset")) {
    if (!require("BiocManager"))
        install.packages("BiocManager")
    BiocManager::install("GenomicAlignments")
}
library(pasillaBamSubset)

2.2 Initialisation & Ideograms

The first step in creating a circular plot is to initialise it. This involves creating the ideogram (the rectangles that represent each sequence), which lays out the sectors for data to be plotted into. To do this, we need some ideogram data, in one of the following formats:

  • A GRanges, with one range for each sector you’d like to plot.
  • A data.frame, with three columns: chr (sector’s name), start and end.

For example, the following two ideogram data are equivalent:

ideogram_1 <- GRanges(seqnames = c("chrA", "chrB", "chrC"),
                 ranges = IRanges(start = rep(0, 3), end = rep(1000, 3)))
ideogram_2 <- data.frame(chr = c("chrA", "chrB", "chrC"), 
                    start = rep(0, 3),
                    end = rep(1000, 3))
print(ideogram_1)
#> GRanges object with 3 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chrA    0-1000      *
#>   [2]     chrB    0-1000      *
#>   [3]     chrC    0-1000      *
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome; no seqlengths
print(ideogram_2)
#>    chr start  end
#> 1 chrA     0 1000
#> 2 chrB     0 1000
#> 3 chrC     0 1000

Both of the higher level functions featureDiagram and insertionDiagram do this as their first step.

2.2.1 Reading in the ideogram data

Of course, typing this manually each time is troublesome. gmoviz provides the function getIdeogramData which creates a GRanges of the ideogram data from either a .bam file, single .fasta file or a folder containing many .fasta files.1 Note that reading in from a .bam file is significantly faster than from a .fasta file. This function can be used as follows:

## from a .bam file
fly_ideogram <- getIdeogramData(bam_file = pasillaBamSubset::untreated3_chr4())

## from a single .fasta file
fly_ideogram_chr4_only <- getIdeogramData(
  fasta_file = pasillaBamSubset::dm3_chr4())

But what if we wanted to read in just the chr3L? Luckily getIdeogramData has filters to select the specific sequences you want.

2.2.1.1 Filtering ideogram data

When reading in the ideogram data from file, there are often sequences in the .bam file or .fasta file folder that are not necessary for the plot. Thus, the getIdeogramData function provides three filters to allow you to only read in the sequences you want.2 These filters only work on the bam_file and fasta_folder input methods. Using a fasta_file means that filtering is not possible (although you can of course edit the ideogram GRanges after it is generated).

If we want only a single chromosome/sequence, we can supply it to wanted_chr:

getIdeogramData(bam_file = pasillaBamSubset::untreated3_chr4(),
                wanted_chr = "chr4")
#> GRanges object with 1 range and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr4 0-1351857      *
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Alternatively, if we want all chromosomes/sequences expect one, we can supply it to unwanted_chr:

getIdeogramData(bam_file = pasillaBamSubset::untreated3_chr4(),
                unwanted_chr = "chrM")
#> GRanges object with 7 ranges and 0 metadata columns:
#>       seqnames     ranges strand
#>          <Rle>  <IRanges>  <Rle>
#>   [1]    chr2L 0-23011544      *
#>   [2]    chr2R 0-21146708      *
#>   [3]    chr3L 0-24543557      *
#>   [4]    chr3R 0-27905053      *
#>   [5]     chr4  0-1351857      *
#>   [6]     chrX 0-22422827      *
#>   [7]  chrYHet   0-347038      *
#>   -------
#>   seqinfo: 7 sequences from an unspecified genome; no seqlengths

Finally, you can supply any regex pattern to just_pattern to create your own custom filter:

getIdeogramData(bam_file = pasillaBamSubset::untreated3_chr4(),
                just_pattern = "R$")
#> GRanges object with 2 ranges and 0 metadata columns:
#>       seqnames     ranges strand
#>          <Rle>  <IRanges>  <Rle>
#>   [1]    chr2R 0-21146708      *
#>   [2]    chr3R 0-27905053      *
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths

Of course, for these filters to work the spelling of the filter must exactly match the spelling of the .fasta file names or the sequences in the .bam file.

2.2.2 Initialising the graph

Now that we have the ideogram data, we can initialise the graph. For this example, we will just focus on chromosome 4.

gmovizInitialise(fly_ideogram_chr4_only, track_height = 0.15)

We can see that a rectangle has been plotted and labelled to indicate chr4. Changing a few visual settings, we can create a better looking ideogram:

gmovizInitialise(fly_ideogram_chr4_only, 
                 space_between_sectors = 25, # bigger space between start & end 
                 start_degree = 78, # rotate the circle
                 sector_label_size = 1, # bigger label
                 track_height = 0.15, # thicker rectangle
                 xaxis_spacing = 30) # label every 30 degrees on the x axis

However, these small tweaks are not the only way we can enhance the appearance of our plot. gmovizInitialise can also display coverage data and labels, as well as supporting zooming and alteration of sector widths.

2.2.2.1 ‘Coverage rectangles’

As demonstrated with the insertionDiagram and featureDiagram functions, we can supply some coverage_data to enhance the ideogram and change the regular rectangles into line graphs which display the coverage (‘coverage rectangles’). This then allows the easy identification of deletions, duplications and other events which alter the coverage.

2.2.2.1.1 Reading in coverage data

To do this, we must first read in the coverage information from the .bam file. This can be done with the getCoverage function:

chr4_coverage <- getCoverage(
  regions_of_interest = "chr4", 
  bam_file = pasillaBamSubset::untreated3_chr4(),
  window_size = 350, smoothing_window_size = 400)

Here, we get the smoothed and windowed coverage for chr4.3 See below the section on smoothing and windowing for the effect of each of these arguments As we wanted the coverage for the entire chr4, we could simply make regions_of_interest = "chr4". However, we could also have supplied a GRanges describing that area instead. Whichever input is used, it is really important that the sequence names match exactly. For example, the following will t hrow an error, because there is no sequence named “4” or “Chr4” in the .bam file:

getCoverage(regions_of_interest = "4", 
            bam_file = pasillaBamSubset::untreated3_chr4(),
            window_size = 300, smoothing_window_size = 400)
#> Error in getCoverage(regions_of_interest = "4", bam_file = pasillaBamSubset::untreated3_chr4(), : Make sure all of the chromsomes in regions_of_interest are in
#>                 the bam file and spelled exactly the same as in the bam
getCoverage(regions_of_interest = "Chr4", 
            bam_file = pasillaBamSubset::untreated3_chr4(),
            window_size = 300, smoothing_window_size = 400)
#> Error in getCoverage(regions_of_interest = "Chr4", bam_file = pasillaBamSubset::untreated3_chr4(), : Make sure all of the chromsomes in regions_of_interest are in
#>                 the bam file and spelled exactly the same as in the bam
2.2.2.1.2 Plotting coverage

Now that we have the coverage data, we can plot the ideogram again using this information. To draw a ‘coverage rectangle’ we need to firstly specifiy the coverage_data to be used (as either a GRanges or a data frame) and then also supply to coverage_rectangle a vector of the sector names to plot the coverage data for.4 This means that you can have the coverage of multiple sequences/regions in the same GRanges but choose to plot only some of them.

gmovizInitialise(ideogram_data = fly_ideogram_chr4_only, 
                 coverage_rectangle = "chr4", 
                 coverage_data = chr4_coverage,
                 xaxis_spacing = 30) 

As you can see, the chr4 ideogram rectangle is replaced with a line graph showing the coverage over the entire chromosome. The coloured area represents the coverage, allowing easy identification of high and low coverage areas.

2.2.2.1.3 Smoothing and windowing coverage data

When reading in the coverage data, there are two additional parameters window_size and smoothing_window_size that modify the values.

  • window_size controls the window size over which coverage is calculated (where a window size of 1 is per base coverage. A larger window size will reduce the time taken to read in, smooth and plot the coverage. It will also remove some of the variation in the coverage, although this is not its primary aim. If you have more than 10-15,000 points, it is highly recommended to use a larger window size, as this will take a long time to plot.

  • smoothing_window_size controls the window used for moving average smoothing, as carried out by the pracma package. It does not reduce the number of points and so offers no speed improvement (in fact, it increases the time taken to read in the coverage data). It does, however, reduce the variation to produce a smoother, more attractive plot.

For example, try running the following:

# default window size (per base coverage)
system.time({getCoverage(regions_of_interest = "chr4", 
                         bam_file = pasillaBamSubset::untreated3_chr4())})

# window size 100
system.time({getCoverage(regions_of_interest = "chr4", 
                         bam_file = pasillaBamSubset::untreated3_chr4(),
                         window_size = 100)})

# window size 500
system.time({getCoverage(regions_of_interest = "chr4", 
                         bam_file = pasillaBamSubset::untreated3_chr4(),
                         window_size = 500)})

Notice how going from the default window size of 1 (per base coverage) to a relatively modest window size of 100 dramatically reduces the time needed to read in the coverage data.

In terms of the appearance of the plot: (note: for speed, we will plot only a subset of the chromosome: from 70000-72000bp)

# without smoothing
chr4_region <- GRanges("chr4", IRanges(70000, 72000))
chr4_region_coverage <- getCoverage(regions_of_interest = chr4_region,
                          bam_file = pasillaBamSubset::untreated3_chr4())
gmovizInitialise(ideogram_data = chr4_region, coverage_rectangle = "chr4", 
                 coverage_data = chr4_region_coverage, custom_ylim = c(0,4))

# with moderate smoothing 
chr4_region_coverage <- getCoverage(regions_of_interest = chr4_region,
                          bam_file = pasillaBamSubset::untreated3_chr4(),
                          smoothing_window_size = 10)
gmovizInitialise(ideogram_data = chr4_region, coverage_rectangle = "chr4", 
                 coverage_data = chr4_region_coverage, custom_ylim = c(0,4))

# with strong smoothing
chr4_region_coverage <- getCoverage(regions_of_interest = chr4_region,
                          bam_file = pasillaBamSubset::untreated3_chr4(),
                          smoothing_window_size = 75)
gmovizInitialise(ideogram_data = chr4_region, coverage_rectangle = "chr4", 
                 coverage_data = chr4_region_coverage, custom_ylim = c(0,4))