Comprehensive archiving of genome-scale sequencing experiments is valuable for substantive and methodological progress in multiple domains.
The HumanTranscriptomeCompendium package provides functions for interacting with quantifications and metadata for over 180000 sequenced human transcriptomes.
BiocFileCache is used to manage access
to a modest collection of metadata about compendium
contents. By default,
load the cache and establish a connection to
remote HDF5 representation of quantifications.
The numerical data is lodged in an instance of
the HDF Scalable Data Service, at
## Loading required namespace: BiocFileCache
## adding rname 'https://biocfound-bigrnatx.s3.us-west-2.amazonaws.com/rangedHtxGeneSE.rds'
## class: RangedSummarizedExperiment ## dim: 58288 181134 ## metadata(1): rangeSource ## assays(1): counts_lstpm ## rownames(58288): ENSG00000000003.14 ENSG00000000005.5 ... ## ENSG00000284747.1 ENSG00000284748.1 ## rowData names(0): ## colnames(181134): DRX001125 DRX001126 ... SRX999990 SRX999991 ## colData names(4): experiment_accession experiment_platform ## study_accession study_title
## <58288 x 181134> matrix of class DelayedMatrix and type "double": ## DRX001125 DRX001126 DRX001127 ... SRX999990 ## ENSG00000000003.14 40.001250 1322.844547 1528.257578 . 1149.0341 ## ENSG00000000005.5 0.000000 9.999964 6.000006 . 0.0000 ## ENSG00000000419.12 64.000031 1456.004418 2038.996875 . 1485.0003 ## ENSG00000000457.13 31.814591 1583.504257 1715.041308 . 631.7751 ## ENSG00000000460.16 12.430602 439.321234 529.280324 . 945.6903 ## ... . . . . . ## ENSG00000284744.1 1.05614505 24.81388079 32.29261298 . 7.316061 ## ENSG00000284745.1 0.99999879 15.99996994 16.99999743 . 0.000000 ## ENSG00000284746.1 0.00000000 0.00379458 0.00000000 . 0.000000 ## ENSG00000284747.1 7.77564984 270.83296409 239.88056843 . 108.011633 ## ENSG00000284748.1 1.00000768 22.23010514 37.73881938 . 11.278980 ## SRX999991 ## ENSG00000000003.14 1430.3955 ## ENSG00000000005.5 0.0000 ## ENSG00000000419.12 1970.0004 ## ENSG00000000457.13 802.0563 ## ENSG00000000460.16 1259.7648 ## ... . ## ENSG00000284744.1 3.268453 ## ENSG00000284745.1 0.000000 ## ENSG00000284746.1 0.000000 ## ENSG00000284747.1 94.606851 ## ENSG00000284748.1 5.240970
We use crude pattern-matching in the study titles to identify single cell RNA-seq experiments
##  59886
Now we will determine which studies are involved. We will check out the titles of the single-cell studies to assess the specificity of this approach.
##  142
## class: RangedSummarizedExperiment ## dim: 58288 662 ## metadata(1): rangeSource ## assays(1): counts_lstpm ## rownames(58288): ENSG00000000003.14 ENSG00000000005.5 ... ## ENSG00000284747.1 ENSG00000284748.1 ## rowData names(0): ## colnames(662): ERX1097381 ERX1097382 ... SRX972028 SRX972029 ## colData names(4): experiment_accession experiment_platform ## study_accession study_title
To acquire numerical values,
as.matrix(assay()) is needed.
## Warning in data.frame(x = x.new, y = y): row names were found from a short ## variable and have been discarded
This feature is not available until further notice.
genesOnly to FALSE in
we can obtain a transcript-level version of the compendium.
Note that the number of samples in this version exceeds
that of the gene version by two. There are two
unintended columns in the underlying HDF Cloud
array, with names ‘X0’ and ‘X0.1’, that should
The primary purposes of the HumanTranscriptomeCompendium package are
We will address these in turn.
htx_load has three arguments:
genesOnly defaults to TRUE. If it is TRUE, the HDF array that
will be used consists of gene-level quantifications; otherwise
the array in use will consist of transcript-level quantifications
based on the Gencode V27 models.
remotePath is the path to an RDS-formatted RangedSummarizedExperiment
instance that has been prepared to include a DelayedArray
reference to the HSDS representation of the quantifications. The
specific reference used depends on the setting of
The default value currently references an AWS S3 bucket to
retrieve the RDS.
cache is an instance of
BiocFileCache, where the RDS
will be stored and retrieved as needed.
A typical use is
htx = htx_load() which efficiently sets
htx to give access to gene-level quantifications.
After such a command is issued,
assay(htx[G, S]) is the
DelayedMatrix for features
G on samples
are too long, the HSDS may return an error. Systematic
chunking of large requests is a topic of future development.
htx_query_by_study_accession has one mandatory argument,
study_accessions. This function uses
htx_load to prepare a SummarizedExperiment
with DelayedArray assay data,
with samples limited to those in the studies listed in the character vector
study_accessions. Optional arguments to this function
are passed to
htx_app has no arguments. It fires up a shiny app that lists studies by
size, study accession number, and study title. User can search titles
using regular expressions, and can ask for retrieval of multiple studies.
The studies are returned in a SummarizedExperiment. This is for use in R.
A more advanced query/retrieval app is prototyped at vjcitn.shinyapps.io/cancer9k.
The cancer9k app provides a ‘search engine’-like capability over a richer
collection of sample-level attributes. See the package at vjcitn/htxapp
for the sources related to cancer9k.
A number of the functions described in this subsection make use of the SRAdbV2 package managed at github.com/seandavi/SRAdbV2. If this package is not installed, some of the functions described will fail.
This is a vector of length 3829708. It provides relative paths for all relevant salmon output files developed in the BigRNA project.
This is a data.frame with 294174
rows and 6 columns. It is
a record of all SRA experiments for which metadata was retrieved
via SRAdbV2 as of 28 June 2018.
studTable provides study title
for each experiment.
This function uses SRAdbV2 in real time to acquire study-level metadata component ‘sample.attributes’ for a selected SRA study accession number.
This function reads
tx2gene.gencode.v27.csv from tximportData.
A character vector of 186011 unique experiment accession numbers.
A vector of 181136 strings giving the column names for the transcript-level quantifications.
This utility will add a rowData component to the result of
htx_load(, ..., genesOnly=TRUE ) giving the gene type, gene id,
gene name, and havana gene for each row as available.
The production of HumanTranscriptomeCompendium has considerable complexity. There is a persistent repository of salmon outputs at
where experiment accession is substitute for *.
an experiment accession number and materializes the salmon quantification
for the user in the form
> str(nn) List of 4 $ abundance : num [1:58288, 1] 22.8668 0.0286 32.8925 2.9392 4.1314 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:58288] "ENSG00000000003.14" "ENSG00000000005.5" "ENSG00000000419.12" "ENSG00000000457.13" ... .. ..$ : NULL $ counts : num [1:58288, 1] 2427 2 1744 634 662 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:58288] "ENSG00000000003.14" "ENSG00000000005.5" "ENSG00000000419.12" "ENSG00000000457.13" ... .. ..$ : NULL $ length : num [1:58288, 1] 1962 1294 980 3984 2964 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:58288] "ENSG00000000003.14" "ENSG00000000005.5" "ENSG00000000419.12" "ENSG00000000457.13" ... .. ..$ : NULL $ countsFromAbundance: chr "lengthScaledTPM"
This can be used to check the accuracy of the image of the data in HSDS.