1 Introduction

This vignette describes the use of the seqCAT package for authentication, characterisation and evaluation of two or more High Throughput Sequencing samples (HTS; RNA-seq or whole genome sequencing). The principle of the method is built upon previous work, where it was demonstrated that analysing the entirety of the variants found in HTS data provides unprecedented statistical power and great opportunities for functional evaluation of genetic similarities and differences between biological samples (Fasterius et al. 2017).

seqCAT work by creating Single Nucelotide Variant (SNV) profiles of every sample of interest, followed by comparisons between each set to find overall genetic similarity, in addition to detailed analyses of the differences. By analysing your data with this workflow you will not only be able to authenticate your samples to a high degree of confidence, but you will also be able to investigate what genes and transcripts are affected by SNVs differing between your samples, what biological effect they will have, and more. seqCAT’s workflow consists of three separate steps:

1.  Creation of SNV profiles
2.  Comparisons of SNV profiles
3.  Authentication, characterisation and evaluation of profile comparisons

Each step has its own section(s) below demonstrating how to perform the analyses. Input data should be in the form of VCF files, i.e output from variant callers such as the Genome Analysis ToolKit and annotated with software such as SnpEff.

1.1 Installation

The latest stable release of this package can be found on Bioconductor and installed using the biocLite function:

source("https://bioconductor.org/biocLite.R")
biocLite("seqCAT")

This will also install any missing packages requires for full functionality, should they not already exist in your system. If you haven’t installed Bioconductor, you can do so by simply calling biocLite() without specifying a package, and it will be installed for you. You can read more about this at Bioconductor’s installation page. You can find the development version of seqCAT on GitHub.

2 Creation of SNV profiles

The first step of the workflow is to create the SNV profile of each sample, which can then be compared to each other. In order to decrease the computation time for large comparison sets and to facilitate re-analyses with different parameters each SNV profile is saved on disc as a normal .txt file. While computation time is usually not an issue for simple binary comparisons (i.e. comparisons with only two samples), this can quickly become a concern for analyses where samples are compared to several others (A vs B, A vs C, …, and so on). The creation of a SNV profile includes filtering of low-confidence variants and removal of variants below a sequencing depth threshold (10 by default). Only records with the highest SNV impact (i.e. impact on protein function) for each variant is kept, as they are most likely to affect the biology of the cells. Creation of SNV profiles is also implemented in Python (section 2.2), which is faster than the standard implementation in R (section 2.1).

2.1 Create profiles with R

Throughout this vignette we will be using some example data, example.vcf.gz, which comes from the initial publication of the general process of this method (Fasterius et al. 2017). It is a simplified multi-sample VCF file on a subset of chromosome 12 (containing all variants up to position 25400000, in order to keep the file size low) for three different colorectal cancer cell lines: HCT116, HKE3 and RKO.

# Load the package
library("seqCAT")

# List the example VCF file
vcf <- system.file("extdata", "example.vcf.gz",
                   package = "seqCAT")

# Create two SNV profiles
create_profile(vcf, "HCT116", "hct116_profile.txt")
create_profile(vcf, "RKO", "rko_profile.txt", filter_depth = 15)

This creates SNV profiles for the two samples found in the example data (HCT116 and RKO) and saves them as hct116.profile.txt and rko_profile.txt in the current directory, respectively. The profile of the second sample was created with a non-standard filter for sequencing depth (15), which should only be done if you want a stricter criteria for your profile (such as when you’re only interested in higher-than-standard confidence variants).

2.2 Create profiles faster with Python

SNV profiles can also be created with Python, another scripting language, if you have installed it. You will also need to install the PyVCF module, in order for it to run. The Python version can create SNV profiles approximately five to ten times quicker than its R equivalent. This is not important for most users, but is nevertheless included for cases where extra speed is desired.

create_profile(vcf, "RKO", "RKO_profile.txt", python = TRUE)

2.3 Create COSMIC profiles

It is also possible to to compare your samples’ variants to some external source. Such a source is the Catalogue of somatic mutations in cancer, or COSMIC. COSMIC has over a thousand cell line-specific mutational profiles, and is thus a very useful resource if you are working with cell lines.

In order to use the COSMIC cell line database, you need to sign up for an account at their website and get permission to download their files (which is given free of charge to academia and non-profit organisation, but requires a commersial license for for-profit organisations). The required file is the one named CosmicCLP_MutantExport.tsv.gz, listed under complete mutational data here. As redistributing this file is not allowed, this package includes an extremely minimal subset of the original file, only useful for examples in this vignette and unit testing. Do not use this file for your own analyses, as your results will neither be complete nor accurate!

The first thing to check is to see if your specific cell line is available in the database, which can be accomplished using the list_cosmic function:

file <- system.file("extdata", "subset_CosmicCLP_MutantExport.tsv.gz",
                    package = "seqCAT")
cell_lines <- list_cosmic(file)
head(cell_lines)
## [1] "639V"  "A427"  "A549"  "AGS"   "AMO1"  "AN3CA"

This gives us a simple vector containing all the available cell lines in the COSMIC database (this version of the file is for the GRCh37 assembly). You can search it for a cell line of your choice:

any(grepl("HCT116", cell_lines))
## [1] TRUE

All COSMIC-related functions perform some simplification of cell line names (as there is variation in the usage of dashes, dots and other symbols), and are case-insensitive. When you have asserted that your cell line of interest is available, you can then read the profile for that cell line using the read_cosmic function:

cosmic <- read_cosmic(file, "HCT116")
head(cosmic)
## GRanges object with 1 range and 12 metadata columns:
##      seqnames               ranges strand |        gene        sample
##         <Rle>            <IRanges>  <Rle> | <character>   <character>
##   43       12 [25398281, 25398281]      * |        KRAS COSMIC.HCT116
##               ID         CDS          AA             description
##      <character> <character> <character>             <character>
##   43     COSM532     c.38G>A      p.G13D Substitution - Missense
##                                    somatic_status verification_status
##                                       <character>         <character>
##   43 Reported in another cancer sample as somatic            Verified
##              REF         ALT          A1          A2
##      <character> <character> <character> <character>
##   43           C           T           C           T
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

You now have a small, COSMIC SNV profile for your cell line, which you can compare to any other profile you may have data for (more on this below). You can also check how many variants are listed in COSMIC for your particular cell:

length(cosmic)
## [1] 1

Here we only see a single variant for the HCT116 cell line, which is only because of the extreme small subset of the COSMIC databse being used here. HCT116 has, in fact, over 2000 listed COSMIC SNVs, making it one of the more abundantly characterised cell lines available (as most cell lines has only a few hundred SNVs listed in COSMIC). A COSMIC profile of a couple of hundred variants is more common, though, and any analysis based only on COSMIC variants is thus inherently limited.

3 Comparing SNV profiles

3.1 Comparing full profiles

Once each relevant sample has its own SNV profile the comparisons can be performed. First, each profile is read using the read_profile function, which outputs GRanges objects for fast and efficient comparisons.

hct116 <- read_profile("hct116_profile.txt", "HCT116")
rko <- read_profile("rko_profile.txt", "RKO")
head(hct116)
## GRanges object with 6 ranges and 17 metadata columns:
##     seqnames         ranges strand |        rsID               gene
##        <Rle>      <IRanges>  <Rle> | <character>        <character>
##   1       12 [80385, 80385]      * | rs370087224 ABC7-42389800N19.1
##   2       12 [80399, 80399]      * |        None ABC7-42389800N19.1
##   3       12 [80422, 80422]      * | rs373297723 ABC7-42389800N19.1
##   4       12 [80729, 80729]      * | rs375960073 ABC7-42389800N19.1
##   5       12 [83011, 83011]      * | rs370570891 ABC7-42389800N19.1
##   6       12 [83012, 83012]      * | rs374646339 ABC7-42389800N19.1
##              ENSGID          ENSTID         REF         ALT      impact
##         <character>     <character> <character> <character> <character>
##   1 ENSG00000226210 ENST00000400706           C           T    MODIFIER
##   2 ENSG00000226210 ENST00000400706           G           A    MODIFIER
##   3 ENSG00000226210 ENST00000400706           G           A    MODIFIER
##   4 ENSG00000226210 ENST00000400706           A           G    MODIFIER
##   5 ENSG00000226210 ENST00000400706           T           C    MODIFIER
##   6 ENSG00000226210 ENST00000400706           C           G    MODIFIER
##             effect     feature                biotype        DP       AD1
##        <character> <character>            <character> <integer> <integer>
##   1 intron_variant  transcript unprocessed_pseudogene        10         8
##   2 intron_variant  transcript unprocessed_pseudogene        10         4
##   3 intron_variant  transcript unprocessed_pseudogene