This vignette is a step by step guide on how to create tallies for bam files using
Python scripts we provide, it can be useful when tallying on a high performance compute cluster. In most cases the methods explained in the
creating.tallies.within.R vignette will be sufficient and we suggest that you start there.
You will need the following softwares intralled on the system that you want to use for creating tallies.
We start with a fresh install of Ubuntu 12.04 LTS.
First we install required packages in the OS using the package manager (
sudo apt-get install build-essential samtools python-dev swig cython python-numpy python-matplotlib hdf5-tools libhdf5-serial-1.8.4 libhdf5-serial-dev screen subversion git
Let's create a local
Software directory to store our modules.
cd ~ mkdir Software cd Software
wget http://pysam.googlecode.com/files/pysam-0.7.5.tar.gz tar xvfz pysam-0.7.5.tar.gz cd pysam-0.7.5 sudo python setup.py install
cd ~/Software wget https://pypi.python.org/packages/source/H/HTSeq/HTSeq-0.5.4p3.tar.gz tar xvfz HTSeq-0.5.4p3.tar.gz cd HTSeq-0.5.4p3 ./build_it sudo python setup.py install
cd ~/Software wget http://h5py.googlecode.com/files/h5py-2.2.0.tar.gz tar xvfz h5py-2.2.0.tar.gz cd h5py-2.2.0/ sudo python setup.py install
Let's get the
tally.bam.py python script and see if we can run it.
cd ~/Software wget http://www.ebi.ac.uk/~pyl/h5vc/tally.bam.py
Calling the script with the
-h option should print the following help message.
python tally.bam.py -h
usage: tally.bam.py [-h] [--bam BAM [BAM ...]] [--reg REG [REG ...]] [--out OUT] [--group GROUP] [--log LOG] [--padding_front PADDING_FRONT] [--padding_back PADDING_BACK] [--readlen READLEN] [--min_mapq MIN_MAPQ] [--loglvl LOGLVL] [--debug] [--count_skipped] [--core] [--replace] [--gzip] A HTSeq script for tallying mismatches, coverage and deletions a set of samples and regions optional arguments: -h, --help show this help message and exit --bam BAM [BAM ...] bam file to tally in, can be specified more than once; use "SampleID:PatientID:SampleType:BamFileName" to encode Sample ID, Patient ID and Sample Type! Examples: "Pt1Relapse:Patient1:Case:pt1r.bam", "Pt1Control:Patient1:Control:pt1c.bam" --reg REG [REG ...] region to tally in (e.g. X:10000-20000), can be specified more than once; use only the chromosome name for whole chromosome processing --out OUT output destination tag - will be prefixed with group and sample, e.g. VariantTally.case.output.hfs --group GROUP group_id for hfs5 output --log LOG msg-log destination --padding_front PADDING_FRONT Number of cycles from the start of the read to consider as untrustworthy - defaults to 10 --padding_back PADDING_BACK Number of cycles from the end of the read to consider as untrustworthy - defaults to 10 --readlen READLEN Length of the reads in the bam file (all reads in the input bam must have the same length - defaults to 101) --min_mapq MIN_MAPQ minimum mapping quality for a read to be considered - defaults to 5 --loglvl LOGLVL ping every args.loglvl entries - defaults to 100000 --debug Should debug information be printed, this can lead to really big log-files --count_skipped Should cigar operations denoting skipped bases be counted as deletions (CIGAR code "N") --core Should the hfs5 "core" driver be used, this loads the file into memory completely! --replace Replace output file if it exists! --gzip Use gzip compression in the hdf5 file Version: 0.0.1 Written by Paul Theodor Pyl (firstname.lastname@example.org), European Molecular Biology Laboratory (EMBL). (c) 2012. Released under the terms of the GNU General Public License v3.
We should also get the
merge.tallies.by.chromosome.py python script and see if we can run it. This script can be used to merge tally files containing the tallies of different samples on the same chromosome(s) and will be important in settings with large cohorts or where samples are added over time.
cd ~/Software wget http://www.ebi.ac.uk/~pyl/h5vc/merge.tallies.by.chromosome.py
Calling the script with the
-h option should print the following help message.
python merge.tallies.by.chromosome.py -h
usage: merge.tallies.by.chromosome.py [-h] [--tally TALLY [TALLY ...]] [--out OUT] --chrom CHROM [--group GROUP] [--log LOG] [--gzip] A HTSeq script for merging tallies optional arguments: -h, --help show this help message and exit --tally TALLY [TALLY ...] tally to be added; all files must contain the same chromosome; --out OUT output destination --chrom CHROM chromosome to process --group GROUP group_id for hfs5 output --log LOG msg-log destination --gzip Use gzip compression in the hdf5 file Version: 0.0.1 Written by Paul Theodor Pyl (email@example.com), European Molecular Biology Laboratory (EMBL). (c) 2012. Released under the terms of the GNU General Public License v3.
Warning: The following examples take time, compute power and disk space, you might need to spend a day or two on this :)
We will now perform a test run of the tally script using publicly available data from the European Nucleotide Archive. Specifically we will get some WGS runs from two yeast strains (s288c and YB210) that have been sequenced with Illumina HiSeq 2000 machines.
SRR488139We suggest using aspera connect to get these files, but simple ftp is also fine, it might just take a long time.
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR488/SRR488141/SRR488141_1.fastq.gz wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR488/SRR488141/SRR488141_2.fastq.gz
wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR488/SRR488139/SRR488139_1.fastq.gz wget ftp.sra.ebi.ac.uk/vol1/fastq/SRR488/SRR488139/SRR488139_2.fastq.gz
Firstly, install AsperaConnect into the Firefox browser shipped with Ubuntu. The executable will be in your
home directory located at
~/.aspera/connect/bin/ascp -QT -l 300m -i ~/.aspera/connect/etc/asperaweb_id_dsa.putty firstname.lastname@example.org:/vol1/fastq/SRR488/SRR488141/SRR488141_1.fastq.gz . ~/.aspera/connect/bin/ascp -QT -l 300m -i ~/.aspera/connect/etc/asperaweb_id_dsa.putty email@example.com:/vol1/fastq/SRR488/SRR488141/SRR488141_2.fastq.gz .
~/.aspera/connect/bin/ascp -QT -l 300m -i ~/.aspera/connect/etc/asperaweb_id_dsa.putty firstname.lastname@example.org:/vol1/fastq/SRR488/SRR488139/SRR488139_1.fastq.gz . ~/.aspera/connect/bin/ascp -QT -l 300m -i ~/.aspera/connect/etc/asperaweb_id_dsa.putty email@example.com:/vol1/fastq/SRR488/SRR488139/SRR488139_2.fastq.gz .
bwa+ ENSEMBL yeast genome
We will use
bwa as an example of how to align those reads, of course you may choose a different aligner. Our choice is purely for simplicity of the tutorial and there might be various reasons you could have for choosing more complex setups than this.
cd ~/Software git clone https://github.com/lh3/bwa.git cd bwa; make echo alias bwa='~/Software/bwa/bwa' > ~/.bashrc
We download the reference from Ensembl.
mkdir Reference cd Reference wget ftp://ftp.ensembl.org/pub/release-73/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa.gz gunzip Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa.gz
Before we can use the reference FASTA file to align reads we need to index it for the given aligner.
bwa index Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa
This should not take long.
We align the reads using the
bwa mem command with the following extra options:
-t 4using multithreading with 4 cores (depending on your machine this can be adjusted up or down)
-Mactivate notation of split alignments as secondary alignments to avoid errors with downstream tools (GATK doesn't like multiple primary alignments)
-R '@RG\tID:s288c_one\tSM:ilumina'This command adds read groups to the .bam files
Assuming the .fastq.gz files were downloaded into separate sub-folders named for the respective strain of yeast (i.e.
YB210) we can use the following commands to align the reads.
bwa mem -t 4 -M -R '@RG\tID:s288c\tSM:ilumina' Reference/Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa s288c/SRR488141_1.fastq.gz s288c/SRR488141_2.fastq.gz | samtools view -Shb - > s288c.bam samtools sort s288c.bam s288c.sorted samtools index s288c.sorted.bam
bwa mem -t 4 -M -R '@RG\tID:YB210\tSM:ilumina' Reference/Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa YB210/SRR488139_1.fastq.gz YB210/SRR488139_2.fastq.gz | samtools view -Shb - > YB210.bam samtools sort YB210.bam YB210.sorted samtools index YB210.sorted.bam
This step has to be performed manually since an account with the GATK forums is required before downloading the GATK.
Eventhough it is highly recommended to perform indel realignment (as the GATK can do) for the purpose of this vignette you may skip this step and continue with the actually tallying.
Once you have downloaded and extracted it, the folder should contain the following files (assuming you did
cd $GATK ls -l
-rw-r--r-- 1 pyl pyl 14436682 Aug 28 22:32 GenomeAnalysisTK.jar drwxr-xr-x 2 pyl pyl 4096 Sep 12 12:44 resources
cd ~/Software wget http://downloads.sourceforge.net/project/picard/picard-tools/1.98/picard-tools-1.98.zip unzip picard-tools-1.98.zip
cd ~/Reference java -jar ~/Software/picard-tools-1.98/CreateSequenceDictionary.jar R=Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa O=Saccharomyces_cerevisiae.EF4.73.dna.toplevel.dict samtools faidx Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa
You can now run the
RealignerTargetCreator walker to create a set of intervals in which realignment might be necessary.
You need Java 7 for this; i.e. do
sudo apt-get install openjdk-7-jre first.
gatk -T RealignerTargetCreator -nt 4 -R Reference/Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa -I s288c.sorted.bam -I YB210.sorted.bam -o target.intervals
The parameters are:
-nt 44 threads in parallel
-o target.intervalsthis file will hold the target intervals to realign in
The resulting file is simply a list of intervals in which indels might occur and that need to be realigned.
head -n 3 target.intervals
I:1723-1724 I:2875-2876 I:3889-3890
We can now run the
IndelRealigner walker with the following command:
gatk -T IndelRealigner -R Reference/Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa -I s288c.sorted.bam -I YB210.sorted.bam --targetIntervals target.intervals -nWayOut .realigned.bam
This might take a while …
The tallying is based on the
MD tag in the
.bam file which encodes mismatches, this can be invalidated by hte realignment and therefore has to be recomputed.
samtools calmd s288c.sorted.realigned.bam Reference/Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa -b | samtools view -bh - > s288c.final.bam samtools calmd YB210.sorted.realigned.bam Reference/Saccharomyces_cerevisiae.EF4.73.dna.toplevel.fa -b | samtools view -bh - > YB210.final.bam samtools index s288c.final.bam samtools index YB210.final.bam
Now we have our realigned files with correct
MD tags and indexes and can finally use them to create the tally.
We specify the sample
s288c as control and the sample
YB210 as case, create the gzipped output file
yeast.hfs5 and specify the readlength, which is
100 we also specify all the chromosomes as regions to tally in. In a human sample we would most likely utilise a compute cluster and parallelize on the chromosome level, but for the yeast genome this is not neccessary.
python tally.bam.py --bam s288c:yeast:Control:s288c.final.bam --bam YB210:yeast:Case:YB210.final.bam --out yeast.hfs5 --group yeast --reg I --reg II --reg III --reg IV --reg V --reg VI --reg VII --reg VIII --reg IX --reg X --reg XI --reg XII --reg XIII --reg XIV --reg XV --reg XVI --reg Mito --gzip --readlen 100 --replace
In some settings we might not want to create the tally file for the whole cohort in one thread on one machine. This might be due to the size of the cohort (e.g. many human whole genome samples) or because we expect more samples to be added over time. In this case we can parallelize the tallying by sample and chromosome, so that in a first step we create a tally file for each chromosome in each sample and then merge the files for the same chromosomes in all samples into the final set of files.
We could also merge the files of different chromosomes together but most operations we want to perform on tallies happen on a single chromosome (or region therein) without depending on data from other chromosomes so this is not neccessary.
The following example shows a parallelized version of the workflow for tallying our two yeast strains (
for chrom in I II III IV V VI VII VIII IX X XI XII XIII XIV XV XVI Mito do bsub -J $chrom.s288c "python tally.bam.py --reg $chrom --bam s288c:yeast:Control:s288c.final.bam --out s288c.$chrom.hfs5 --group yeast --gzip --readlen 100 --replace" bsub -J $chrom.YB210 "python tally.bam.py --reg $chrom --bam YB210:yeast:Case:YB210.final.bam --out YB210.$chrom.hfs5 --group yeast --gzip --readlen 100 --replace" done
We use the
bsub command to interact with a LSF cluster and depending on your available compute infrastructure you might have to adapt this part of the code to use a different command (or simply send the processes to the background by appending
& if you have a multicore machine with a performant storage solution, i.e. RAID / SSDs)
Running this code produces files like
YB210.IV.hfs5. We will now create one tally file per chromosome by merging the files belonging to the same chromosome.
for chrom in I II III IV V VI VII VIII IX X XI XII XIII XIV XV XVI Mito do bsub -J $chrom.yeast.merge "python merge.tallies.by.chromosome.py --out yeast.$chrom.hfs5 --chrom $chrom --group yeast --gzip --tally YB210.$chrom.hfs5 --tally s288c.$chrom.hfs5" done
After those jobs are run we have the files
yeast.Mito.hfs5 and can now go on doing analyses on them.
We can now use the tools provided by
rhdf5 to have a look into our tally file.
library(h5vc) library(rhdf5) h5ls("yeast.hfs5")
Here we see a list of all the chromsoomes and datasets, since we put all chromosomes into the same file this is pretty long.
group name otype dclass dim 0 / yeast H5I_GROUP 1 /yeast I H5I_GROUP 2 /yeast/I Counts H5I_DATASET INTEGER 12 x 2 x 2 x 230218 3 /yeast/I Coverages H5I_DATASET INTEGER 2 x 2 x 230218 4 /yeast/I Deletions H5I_DATASET INTEGER 2 x 2 x 230218 5 /yeast/I Reference H5I_DATASET INTEGER 230218 6 /yeast II H5I_GROUP 7 /yeast/II Counts H5I_DATASET INTEGER 12 x 2 x 2 x 813184 8 /yeast/II Coverages H5I_DATASET INTEGER 2 x 2 x 813184 9 /yeast/II Deletions H5I_DATASET INTEGER 2 x 2 x 813184 10 /yeast/II Reference H5I_DATASET INTEGER 813184 [...] 81 /yeast XVI H5I_GROUP 82 /yeast/XVI Counts H5I_DATASET INTEGER 12 x 2 x 2 x 948066 83 /yeast/XVI Coverages H5I_DATASET INTEGER 2 x 2 x 948066 84 /yeast/XVI Deletions H5I_DATASET INTEGER 2 x 2 x 948066 85 /yeast/XVI Reference H5I_DATASET INTEGER 948066
Note how the last dimension (the genomic position) differs from chromosome to chromosome, while the first dimensions (bases, samples and strands) stay the same.
We check if the sample data was correctly added.
getSampleData( "yeast.hfs5", "/yeast/I" )
This data frame is quite small since we only have 2 samples in this cohort. We can see that s288c was calles Control and written into the first slot in the sample dimension (i.e. the coverage of s288c on chromosome I is found in the respective 'Coverages' dataset like so: data$Coverages[1,,], assuming we have used h5dapply to extract the data. See the other vignettes for details.)
Sample Column Patient Type SampleFiles 1 YB210 2 yeast Case YB210.final.bam 2 s288c 1 yeast Control s288c.final.bam
Now we can go on and find differences between the samples with the tools described in the other vignettes or build a genome browser and have a look at the contents of our tally file (see h5vc.simple.genome.browser for this).