Contents

1 Introduction

Welcome to the introduction of data management with ORFik experiment. This vignette will walk you through how to work with large amounts of sequencing data effectively in ORFik. ORFik is an R package containing various functions for analysis of RiboSeq, RNASeq and CageSeq data, we advice you to read ORFikOverview vignette, before starting this one.

1.1 Motivation

NGS libraries are becoming more and more numerous. As a bioinformatician you often need to use many data-sets, like RNA-seq or ribo-seq together, to make some plots or statistics. A lot of things can go wrong when you scale up from just 1 data-set to many.

Another problem is also that annotations like gff and fasta files combined with the NGS data, must be separately loaded. Making it possible to use wrong annotation for the NGS data, or wrong chromosome naming as chr1 vs 1 etc.

1.2 What is an ORFik experiment?

It is an object to massively simplify / error correcting your code, by having a table of all libraries and annotation of an experiment. That contains filepaths and info for each library / annotation files in the experiment. It also tries to guess grouping / types / pairs (paired end bam files etc.) by the file names. It is also a safety in that it verifies your experiments contain no duplicate files, or empty or non-accessible files. Making it almost impossible to load the wrong data. In addition it checks chromosome naming of libraries and annotation, making sure you are not mixing chr1 vs 1 as name for chromosome 1 etc.

The main reason to represent your NGS data as an ORFik experiment will now be shown.

1.3 Example of creating an ORFik experiment

First load ORFik

library(ORFik)

Let’s say we have a human experiment, containing annotation files (.gtf and .fasta genome) + Next generation sequencing libraries (NGS-data); RNA-seq, ribo-seq and CAGE.

# Read from (create.experiment() template)
# 1. Pick directory (normally a folder with bam / bed / wig files)
dir <- system.file("extdata", "", package = "ORFik")
list.files(dir)
##  [1] "QC_STATS"               "annotations.gtf"        "cage-seq-heart.bed.bgz"
##  [4] "features.rdata"         "genome.fasta"           "genome.fasta.fai"      
##  [7] "pshifted"               "ribo-seq-heart.bed.bgz" "ribo-seq.bam"          
## [10] "ribo-seq.bam.bai"       "rna-seq-heart.bed.bgz"
# 2. Pick an experiment name
exper <- "ORFik"
# 3. Pick .gff/.gtf and fasta location
txdb <- system.file("extdata", "annotations.gtf", package = "ORFik")
fasta <- system.file("extdata", "genome.fasta", package = "ORFik")
template <- create.experiment(dir = dir,   # dir is the NGS files
                              exper,       # Experiment name
                              txdb = txdb, # gtf / gff / gff.db annotation
                              fa = fasta,  # Fasta genome
                              organism = "Homo sapiens", # Scientific naming
                              saveDir = NULL, # If not NULL, saves experiment directly
                              viewTemplate = FALSE)
data.frame(template)
##        X1                                                              X2  X3
## 1    name                                                           ORFik    
## 2     gff /tmp/RtmpnKTeMP/Rinst172d7786d3cd/ORFik/extdata/annotations.gtf    
## 3   fasta    /tmp/RtmpnKTeMP/Rinst172d7786d3cd/ORFik/extdata/genome.fasta    
## 4 libtype                                                           stage rep
## 5    CAGE                                                           heart    
## 6     RFP                                                           heart    
## 7     RFP                                                                    
## 8     RNA                                                           heart    
##          X4       X5
## 1                   
## 2           organism
## 3                   
## 4 condition fraction
## 5                   
## 6                   
## 7                   
## 8                   
##                                                                       X6
## 1                                                                       
## 2                                                           Homo sapiens
## 3                                                                       
## 4                                                               filepath
## 5 /tmp/RtmpnKTeMP/Rinst172d7786d3cd/ORFik/extdata/cage-seq-heart.bed.bgz
## 6 /tmp/RtmpnKTeMP/Rinst172d7786d3cd/ORFik/extdata/ribo-seq-heart.bed.bgz
## 7           /tmp/RtmpnKTeMP/Rinst172d7786d3cd/ORFik/extdata/ribo-seq.bam
## 8  /tmp/RtmpnKTeMP/Rinst172d7786d3cd/ORFik/extdata/rna-seq-heart.bed.bgz

You see from the template, it excludes files with .bai or .fai, .rdata etc, and only using data of NGS libraries, defined by argument (type).

You can also see it tries to guess library types, stages, replicates, condition etc. It will also try to auto-detect paired end bam files. To fix the things it did not find, you either save the file and modify in Excel / Libre office, or do it directly in R.

Let’s update the template to have correct tissue-fraction in one of the samples.

template$X5[6] <- "heart_valve" # <- fix non unique row (tissue fraction is heart valve)
# read experiment from template
df <- read.experiment(template)

To save it, do:

save.experiment(df, file = "path/to/save/experiment.csv")

You can then load the experiment whenever you need it.

2 The experiment object

To see the object, just show it like this:

df
## experiment: ORFik with 3 library types and 4 runs 
##    libtype stage    fraction
## 1:    CAGE heart            
## 2:     RFP heart heart_valve
## 3:     RFP                  
## 4:     RNA heart

You see here that file paths are hidden, you can acces them like this:

If you have varying version of libraries, like p-shifted, bam, simplified wig files, you can get filepaths to different version with this function.

filepath(df, type = "default")
## [1] "/tmp/RtmpnKTeMP/Rinst172d7786d3cd/ORFik/extdata/cage-seq-heart.bed.bgz"
## [2] "/tmp/RtmpnKTeMP/Rinst172d7786d3cd/ORFik/extdata/ribo-seq-heart.bed.bgz"
## [3] "/tmp/RtmpnKTeMP/Rinst172d7786d3cd/ORFik/extdata/ribo-seq.bam"          
## [4] "/tmp/RtmpnKTeMP/Rinst172d7786d3cd/ORFik/extdata/rna-seq-heart.bed.bgz"

2.1 Loading all data in experiment

# First load experiment if not present
# We use our already loaded experiment: (df) here

# Load transcript annotation
txdb <- loadTxdb(df) # transcript annotation
## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ... OK
# And now NGS data
outputLibs(df, chrStyle = seqlevelsStyle(txdb)) # Use txdb as seqlevelsStyle reference
## Outputting libraries from: ORFik

By default all libraries are loaded into .GlobalEnv (global environment) with names decided by columns in experiment, to see what the names will be, do:

bamVarName(df) #This will be the names:
## [1] "CAGE_heart"             "RFP_heart_fheart_valve" "RFP"                   
## [4] "RNA_heart"

If you have multiple experiments, it might be a chance of non-unique naming, 2 experiments might have a library called cage. To be sure names are unique, add the experiment in the variable name:

df@expInVarName <- TRUE
bamVarName(df) #This will be the names:
## [1] "ORFik_CAGE_heart"             "ORFik_RFP_heart_fheart_valve"
## [3] "ORFik_RFP"                    "ORFik_RNA_heart"

You see here that the experiment name, “ORFik” is in the variable name If you are only working on one experiment, you do not need to include the name, since there is no possibility of duplicate naming (the experiment class validates all names are unique).

Since we want NGS data names without “ORFik”, let’s remove the loaded libraries and load them again.

df@expInVarName <- FALSE
remove.experiments(df)
## Removed loaded libraries from experiment:ORFik
outputLibs(df, chrStyle = seqlevelsStyle(txdb)) 
## Outputting libraries from: ORFik

2.2 Loading transcript regions

Let’s say we want to load all leaders, cds and 3’ UTRs that are longer than 30. With ORFik experiment this is easy:

txNames <- filterTranscripts(txdb, minFiveUTR = 30,minCDS = 30, minThreeUTR = 30)
loadRegions(txdb, parts = c("leaders", "cds", "trailers"), names.keep = txNames)

The regions are now loaded into .GlobalEnv, only keeping transcripts from txNames.

2.3 Plotting with ORFik experiments

Lets make a plot with coverage over mrna in just ribo-seq

transcriptWindow(leaders, cds, trailers, df[3,])
## RFP
## [[1]]

## 
## [[2]]

3 P-site shifting experiment

If your experiment consists of Ribo-seq, you want to do p-site shifting.

shiftFootprintsByExperiment(df[df$libtype == "RFP",])

P-shifted ribo-seq will automaticly be stored as .wig (wiggle files for IGV and other genome browsers) and .ofst (ORFik serialized for R) files in a ./pshifted folder, relative to original libraries.

To validate p-shifting, use shiftPlots. Here is an example from Bazzini et al. 2014 I made.

df.baz <- read.experiment("zf_bazzini14_RFP")
shiftPlots(df.baz, title = "Ribo-seq, zebrafish, Bazzini et al. 2014")