Creating Tallies with h5py/Python

This vignette is a step by step guide on how to create tallies for bam files using Python scripts we provide, it can be useful when tallying on a high performance compute cluster. In most cases the methods explained in the creating.tallies.within.R vignette will be sufficient and we suggest that you start there.

Software setup

You will need the following softwares intralled on the system that you want to use for creating tallies.

Installation example on Ubuntu 12.04 LTS

We start with a fresh install of Ubuntu 12.04 LTS.

Installing needed system packages

First we install required packages in the OS using the package manager (apt-get).

sudo apt-get install build-essential samtools python-dev swig cython python-numpy python-matplotlib hdf5-tools libhdf5-serial-1.8.4 libhdf5-serial-dev screen subversion git

Getting the python modules

Let's create a local Software directory to store our modules.

cd ~
mkdir Software
cd Software

Installing pysam

wget http://pysam.googlecode.com/files/pysam-0.7.5.tar.gz
tar xvfz pysam-0.7.5.tar.gz
cd pysam-0.7.5
sudo python setup.py install

Installing HTSeq

cd ~/Software
wget https://pypi.python.org/packages/source/H/HTSeq/HTSeq-0.5.4p3.tar.gz
tar xvfz HTSeq-0.5.4p3.tar.gz
cd HTSeq-0.5.4p3
./build_it
sudo python setup.py install

Installing h5py

cd ~/Software
wget http://h5py.googlecode.com/files/h5py-2.2.0.tar.gz
tar xvfz h5py-2.2.0.tar.gz
cd h5py-2.2.0/
sudo python setup.py install

Testing the installation

Let's get the tally.bam.py python script and see if we can run it.

cd ~/Software
wget http://www.ebi.ac.uk/~pyl/h5vc/tally.bam.py

Running the tally script

Calling the script with the -h option should print the following help message.

python tally.bam.py -h
usage: tally.bam.py [-h] [--bam BAM [BAM ...]] [--reg REG [REG ...]]
                    [--out OUT] [--group GROUP] [--log LOG]
                    [--padding_front PADDING_FRONT]
                    [--padding_back PADDING_BACK] [--readlen READLEN]
                    [--min_mapq MIN_MAPQ] [--loglvl LOGLVL] [--debug]
                    [--count_skipped] [--core] [--replace] [--gzip]

A HTSeq script for tallying mismatches, coverage and deletions a set of
samples and regions

optional arguments:
  -h, --help            show this help message and exit
  --bam BAM [BAM ...]   bam file to tally in, can be specified more than once;
                        use "SampleID:PatientID:SampleType:BamFileName" to
                        encode Sample ID, Patient ID and Sample Type!
                        Examples: "Pt1Relapse:Patient1:Case:pt1r.bam",
                        "Pt1Control:Patient1:Control:pt1c.bam"
  --reg REG [REG ...]   region to tally in (e.g. X:10000-20000), can be
                        specified more than once; use only the chromosome name
                        for whole chromosome processing
  --out OUT             output destination tag - will be prefixed with group
                        and sample, e.g. VariantTally.case.output.hfs
  --group GROUP         group_id for hfs5 output
  --log LOG             msg-log destination
  --padding_front PADDING_FRONT
                        Number of cycles from the start of the read to
                        consider as untrustworthy - defaults to 10
  --padding_back PADDING_BACK
                        Number of cycles from the end of the read to consider
                        as untrustworthy - defaults to 10
  --readlen READLEN     Length of the reads in the bam file (all reads in the
                        input bam must have the same length - defaults to 101)
  --min_mapq MIN_MAPQ   minimum mapping quality for a read to be considered -
                        defaults to 5
  --loglvl LOGLVL       ping every args.loglvl entries - defaults to 100000
  --debug               Should debug information be printed, this can lead to
                        really big log-files
  --count_skipped       Should cigar operations denoting skipped bases be
                        counted as deletions (CIGAR code "N")
  --core                Should the hfs5 "core" driver be used, this loads the
                        file into memory completely!
  --replace             Replace output file if it exists!
  --gzip                Use gzip compression in the hdf5 file

Version: 0.0.1 Written by Paul Theodor Pyl (pyl@embl.de), European Molecular
Biology Laboratory (EMBL). (c) 2012. Released under the terms of the GNU
General Public License v3.

We should also get the merge.tallies.by.chromosome.py python script and see if we can run it. This script can be used to merge tally files containing the tallies of different samples on the same chromosome(s) and will be important in settings with large cohorts or where samples are added over time.

cd ~/Software
wget http://www.ebi.ac.uk/~pyl/h5vc/merge.tallies.by.chromosome.py

Running the merging script

Calling the script with the -h option should print the following help message.

python merge.tallies.by.chromosome.py -h
usage: merge.tallies.by.chromosome.py [-h] [--tally TALLY [TALLY ...]]
                                      [--out OUT] --chrom CHROM
                                      [--group GROUP] [--log LOG] [--gzip]

A HTSeq script for merging tallies

optional arguments:
  -h, --help            show this help message and exit
  --tally TALLY [TALLY ...]
                        tally to be added; all files must contain the same
                        chromosome;
  --out OUT             output destination
  --chrom CHROM         chromosome to process
  --group GROUP         group_id for hfs5 output
  --log LOG             msg-log destination
  --gzip                Use gzip compression in the hdf5 file

Version: 0.0.1 Written by Paul Theodor Pyl (pyl@embl.de), European Molecular
Biology Laboratory (EMBL). (c) 2012. Released under the terms of the GNU
General Public License v3.

An example run with publicly available data

Warning: The following examples take time, compute power and disk space, you might need to spend a day or two on this :)

We will now perform a test run of the tally script using publicly available data from the European Nucleotide Archive. Specifically we will get some WGS runs from two yeast strains (s288c and YB210) that have been sequenced with Illumina HiSeq 2000 machines.

wget FTP download of reads

s288c runs

wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR488/SRR488141/SRR488141_1.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR488/SRR488141/SRR488141_2.fastq.gz  

YB210 runs

wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR488/SRR488139/SRR488139_1.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR488/SRR488139/SRR488139_2.fastq.gz

aspera connect download of reads

Firstly, install AsperaConnect into the Firefox browser shipped with Ubuntu. The executable will be in your home directory located at ~/.aspera/connect/bin

s288c runs

~/.aspera/connect/bin/ascp -QT -l 300m -i ~/.aspera/connect/etc/asperaweb_id_dsa.putty era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR488/SRR488141/SRR488141_1.fastq.gz .
~/.aspera/connect/bin/ascp -QT -l 300m -i ~/.aspera/connect/etc/asperaweb_id_dsa.putty era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR488/SRR488141/SRR488141_2.fastq.gz .

YB210 runs

~/.aspera/connect/bin/ascp -QT -l 300m -i ~/.aspera/connect/etc/asperaweb_id_dsa.putty era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR488/SRR488139/SRR488139_1.fastq.gz .
~/.aspera/connect/bin/ascp -QT -l 300m -i ~/.aspera/connect/etc/asperaweb_id_dsa.putty era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR488/SRR488139/SRR488139_2.fastq.gz .

setting up the aligner and reference - bwa + ENSEMBL yeast genome

We will use bwa as an example of how to align those reads, of course you may choose a different aligner. Our choice is purely for simplicity of the tutorial and there might be various reasons you could have for choosing more complex setups than this.

Downloading the newest bwa

cd ~/Software
git clone https://github.com/lh3/bwa.git
cd bwa; make
echo alias bwa='~/Software/bwa/bwa' > ~/.bashrc

Getting the reference Genome

We download the reference from Ensembl.

mkdir Reference
cd Reference
wget ftp://ftp.ensembl.org/pub/release-73/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa.gz
gunzip Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa.gz

Indexing the reference genome

Before we can use the reference FASTA file to align reads we need to index it for the given aligner.

bwa index Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa

This should not take long.

Aligning the reads

We align the reads using the bwa mem command with the following extra options:

Assuming the .fastq.gz files were downloaded into separate sub-folders named for the respective strain of yeast (i.e. s288c and YB210) we can use the following commands to align the reads.

s288c

bwa mem -t 4 -M -R '@RG\tID:s288c\tSM:ilumina' Reference/Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa s288c/SRR488141_1.fastq.gz s288c/SRR488141_2.fastq.gz | samtools view -Shb - > s288c.bam
samtools sort s288c.bam s288c.sorted
samtools index s288c.sorted.bam

YB210

bwa mem -t 4 -M -R '@RG\tID:YB210\tSM:ilumina' Reference/Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa YB210/SRR488139_1.fastq.gz YB210/SRR488139_2.fastq.gz | samtools view -Shb - > YB210.bam
samtools sort YB210.bam YB210.sorted
samtools index YB210.sorted.bam

Downloading the GATK for IndelRealignment

This step has to be performed manually since an account with the GATK forums is required before downloading the GATK. Eventhough it is highly recommended to perform indel realignment (as the GATK can do) for the purpose of this vignette you may skip this step and continue with the actually tallying. Once you have downloaded and extracted it, the folder should contain the following files (assuming you did export GATK=/path/to/GATK/):

cd $GATK
ls -l
-rw-r--r-- 1 pyl pyl 14436682 Aug 28 22:32 GenomeAnalysisTK.jar
drwxr-xr-x 2 pyl pyl     4096 Sep 12 12:44 resources

Downloading Picard Tools

cd ~/Software
wget http://downloads.sourceforge.net/project/picard/picard-tools/1.98/picard-tools-1.98.zip
unzip picard-tools-1.98.zip

Prepare the Reference for use with GATK

cd ~/Reference
java -jar ~/Software/picard-tools-1.98/CreateSequenceDictionary.jar R=Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa O=Saccharomyces_cerevisiae.EF4.73.dna.toplevel.dict
samtools faidx Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa

You can now run the RealignerTargetCreator walker to create a set of intervals in which realignment might be necessary. You need Java 7 for this; i.e. do sudo apt-get install openjdk-7-jre first.

gatk -T RealignerTargetCreator -nt 4 -R Reference/Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa -I s288c.sorted.bam -I YB210.sorted.bam -o target.intervals

The parameters are:

The resulting file is simply a list of intervals in which indels might occur and that need to be realigned.

head -n 3 target.intervals 
I:1723-1724
I:2875-2876
I:3889-3890

Using the indel realignment pipeline of the GATK

We can now run the IndelRealigner walker with the following command:

gatk -T IndelRealigner -R Reference/Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa -I s288c.sorted.bam -I YB210.sorted.bam --targetIntervals target.intervals -nWayOut .realigned.bam

This might take a while …

The tallying is based on the MD tag in the .bam file which encodes mismatches, this can be invalidated by hte realignment and therefore has to be recomputed.

samtools calmd s288c.sorted.realigned.bam Reference/Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa -b | samtools view -bh - > s288c.final.bam
samtools calmd YB210.sorted.realigned.bam Reference/Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa -b | samtools view -bh - > YB210.final.bam
samtools index s288c.final.bam
samtools index YB210.final.bam

Creating the tally

Now we have our realigned files with correct MD tags and indexes and can finally use them to create the tally. We specify the sample s288c as control and the sample YB210 as case, create the gzipped output file yeast.hfs5 and specify the readlength, which is 100 we also specify all the chromosomes as regions to tally in. In a human sample we would most likely utilise a compute cluster and parallelize on the chromosome level, but for the yeast genome this is not neccessary.

python tally.bam.py --bam s288c:yeast:Control:s288c.final.bam --bam YB210:yeast:Case:YB210.final.bam --out yeast.hfs5 --group yeast --reg I --reg II --reg III --reg IV --reg V --reg VI --reg VII --reg VIII --reg IX --reg X --reg XI --reg XII --reg XIII --reg XIV --reg XV --reg XVI --reg Mito --gzip --readlen 100 --replace

Creating the tally in parallel

In some settings we might not want to create the tally file for the whole cohort in one thread on one machine. This might be due to the size of the cohort (e.g. many human whole genome samples) or because we expect more samples to be added over time. In this case we can parallelize the tallying by sample and chromosome, so that in a first step we create a tally file for each chromosome in each sample and then merge the files for the same chromosomes in all samples into the final set of files.

We could also merge the files of different chromosomes together but most operations we want to perform on tallies happen on a single chromosome (or region therein) without depending on data from other chromosomes so this is not neccessary.

The following example shows a parallelized version of the workflow for tallying our two yeast strains (bash example).

for chrom in I II III IV V VI VII VIII IX X XI XII XIII XIV XV XVI Mito
do
  bsub -J $chrom.s288c "python tally.bam.py --reg $chrom --bam s288c:yeast:Control:s288c.final.bam --out s288c.$chrom.hfs5 --group yeast --gzip --readlen 100 --replace"
  bsub -J $chrom.YB210 "python tally.bam.py --reg $chrom --bam YB210:yeast:Case:YB210.final.bam --out YB210.$chrom.hfs5 --group yeast --gzip --readlen 100 --replace"
done

We use the bsub command to interact with a LSF cluster and depending on your available compute infrastructure you might have to adapt this part of the code to use a different command (or simply send the processes to the background by appending & if you have a multicore machine with a performant storage solution, i.e. RAID / SSDs)

Running this code produces files like s288c.XIV.hfs5 or YB210.IV.hfs5. We will now create one tally file per chromosome by merging the files belonging to the same chromosome.

for chrom in I II III IV V VI VII VIII IX X XI XII XIII XIV XV XVI Mito
do
  bsub -J $chrom.yeast.merge "python merge.tallies.by.chromosome.py --out yeast.$chrom.hfs5 --chrom $chrom --group yeast --gzip --tally YB210.$chrom.hfs5 --tally s288c.$chrom.hfs5"
done

After those jobs are run we have the files yeast.I.hfs5 to yeast.Mito.hfs5 and can now go on doing analyses on them.

Did it work

We can now use the tools provided by h5vc and rhdf5 to have a look into our tally file.

library(h5vc)
library(rhdf5)
h5ls("yeast.hfs5")

Here we see a list of all the chromsoomes and datasets, since we put all chromosomes into the same file this is pretty long.

         group      name       otype  dclass                  dim
0            /     yeast   H5I_GROUP
1       /yeast         I   H5I_GROUP
2     /yeast/I    Counts H5I_DATASET INTEGER  12 x 2 x 2 x 230218
3     /yeast/I Coverages H5I_DATASET INTEGER       2 x 2 x 230218
4     /yeast/I Deletions H5I_DATASET INTEGER       2 x 2 x 230218
5     /yeast/I Reference H5I_DATASET INTEGER               230218
6       /yeast        II   H5I_GROUP
7    /yeast/II    Counts H5I_DATASET INTEGER  12 x 2 x 2 x 813184
8    /yeast/II Coverages H5I_DATASET INTEGER       2 x 2 x 813184
9    /yeast/II Deletions H5I_DATASET INTEGER       2 x 2 x 813184
10   /yeast/II Reference H5I_DATASET INTEGER               813184

[...]

81      /yeast       XVI   H5I_GROUP
82  /yeast/XVI    Counts H5I_DATASET INTEGER  12 x 2 x 2 x 948066
83  /yeast/XVI Coverages H5I_DATASET INTEGER       2 x 2 x 948066
84  /yeast/XVI Deletions H5I_DATASET INTEGER       2 x 2 x 948066
85  /yeast/XVI Reference H5I_DATASET INTEGER               948066

Note how the last dimension (the genomic position) differs from chromosome to chromosome, while the first dimensions (bases, samples and strands) stay the same.

We check if the sample data was correctly added.

getSampleData( "yeast.hfs5", "/yeast/I" )

This data frame is quite small since we only have 2 samples in this cohort. We can see that s288c was calles Control and written into the first slot in the sample dimension (i.e. the coverage of s288c on chromosome I is found in the respective 'Coverages' dataset like so: data$Coverages[1,,], assuming we have used h5dapply to extract the data. See the other vignettes for details.)

  Sample Column Patient    Type     SampleFiles
1  YB210      2   yeast    Case YB210.final.bam
2  s288c      1   yeast Control s288c.final.bam

Now we can go on and find differences between the samples with the tools described in the other vignettes or build a genome browser and have a look at the contents of our tally file (see h5vc.simple.genome.browser for this).