Suppl. Ch. 2 - Import and Tidy Data

Gabriel Odom


1. Overview

This vignette is the second chapter in the “Pathway Significance Testing with pathwayPCA” workflow, providing a detailed perspective to the Import Data section of the Quickstart Guide. This vignette will discuss using the the read_gmt function to import Gene Matrix Transposed (.gmt) pathway collection files as a list object with class pathwayCollection. Also, we will discuss importing assay and response data, and how to make your assay data tidy. For our pathway analysis to be meaningful, we need gene expression data (from a microarray or something similar), corresponding phenotype information (such as weight, type of cancer, or survival time and censoring indicator), and a pathway collection.

Before we move on, we will outline our steps. After reading this vignette, you should be able to

  1. Import a .gmt file and save the pathways stored therein as a pathwayCollection object using the read_gmt function.
  2. Import an assay .csv file with the read_csv function from the readr package, and transpose this data frame into “tidy” form with the TransposeAssay function.
  3. Import phenotype information stored in a .csv file, and join (merge) it to the assay data frame with the inner_join function from the dplyr package.

First, load the pathwayPCA package and the tidyverse package suite.

# Set tibble data frame print options
options(tibble.max_extra_cols = 10)


2. GMT Files

The .gmt format is a commonly used file format for storing pathway collections. Lists of pathways in the Molecular Signatures Database (MSigDB) can be downloaded from the MSigDB Collections page.

2.1 GMT Format Description

GMT-formatted files follow a very specific set of rules:

  1. Each row of the file represents a pathway, and only one pathway is allowed per line.
  2. The first entry in each row is the pathway name; e.g. "KEGG_STEROID_BIOSYNTHESIS".
  3. The second entry in each row is an optional brief description of the pathway; e.g. "".
  4. The third to the last entry on each row are the gene names in the pathway; e.g. "SOAT1" "LSS" "SQLE" "EBP" "CYP51A1" "DHCR7" "CYP27B1" "DHCR24" "HSD17B7" "MSMO1" "FDFT1" "SC5DL" "LIPA" "CEL" "TM7SF2" "NSDHL" "SOAT2".
  5. Each entry in each line is seperated by a tab.

2.2 Import GMT files with read_gmt

Based on the clearly-organized .gmt file format, we were able to write a very fast function to read .gmt files into R. The read_gmt function takes in a path specifying where your .gmt file is stored, and outputs a pathways list.

We now carefully discuss the form of this information. This cp_pathwayCollection object has class pathwayCollection and contains the following components:

  1. pathways: A list of character vectors. Each character vector should contain a subset of the names of the -Omes measured in your assay data frame. These pathways should not be too short, otherwise we devolve the problem into simply testing individual genes. The pathwayPCA package requires each pathway to have a minimum of three genes recorded in the assay data frame.

Important: some protein set lists have proteins markers recorded as character numerics (e.g. “3”), so make sure the feature names of your assay have an overlap with the gene or protein names in the pathwayCollection list. Ensure that there is a non-empty overlap between the gene names in the pathways list and the feature names of the assay. Not every gene in your assay data frame will be in the pathways list, and not every gene in each pathway will have a corresponding measurement in the assay data frame. However, for meaningful results, there should be a significant overlap between the genes measured in the assay data frame and the gene names stored in the pathways list. If your pathways list has very few matching genes in your assay, then your pathway-based analysis results will be significantly degraded. Make sure your pathways list and assay data are compatible.

  1. TERMS: A character vector comprised of the proper name of each pathway in the pathway collection.
  2. description: (OPTIONAL) A character vector the same length as the pathways list with descriptive information. For instance, the .gmt file included with this package has hyperlinks to the MSigDB description card for that pathway in this field. This field will be imported by the read_gmt function when description = TRUE (it defaults to FALSE).
  3. setsize: the number of genes originally recorded in each pathway, stored as an integer vector. NOTE: this information is calculated and added to the pathways list at Omics-class object creation (later in the workflow). This information is useful to measure the ratio of the number of genes from each pathway recorded in your assay to the number of genes defined to be in that pathway. For each pathway, this ratio should be at least 0.5 for best pathway analysis results.

The object itself has the following structure:

This object will be the list supplied to the pathwayCollection_ls argument in the CreateOmics function.

2.3 Creating Your Own pathwayCollection List

Additionally, you can create a pathwayCollection object from scratch with the CreatePathwayCollection function. This may be useful to users who have their pathway information stored in some form other than a .gmt file. You must supply a list of vectors of gene names to the pathways argument, and a vector of the proper names of each pathway to the TERMS argument. You could also store any other pertinant pathway information by passing a <name> = <value> pair to this function.

2.4 Importing a Pathway Collection from Wikipathways

To download a .gmt file from Wikipathways, we recommend the R package rWikiPathways. From their vignette:

WikiPathways also provides a monthly data release archived at The archive includes GPML, GMT and SVG collections by organism and timestamped. There’s an R function for grabbing files from the archive…


This will simply open the archive in your default browser so you can look around (in case you don’t know what you are looking for). By default, it opens to the latest collection of GPML files. However, if you provide an organism, then it will download that file to your current working directory or specified destpath. For example, here’s how you’d get the latest GMT file for mouse:

downloadPathwayArchive(organism = "Mus musculus", format = "gmt")

And if you might want to specify an archive date so that you can easily share and reproduce your script at any time in the future and get the same result. Remember, new pathways are being added to WikiPathways every month and existing pathways are improved continuously!

downloadPathwayArchive(date = "20171010", organism = "Mus musculus", format = "gmt")

2.4 Writing a pathwayCollection Object to a .gmt File

Finally, we can save the pathwayCollection object we just created via the write_gmt() function:

3. Import and Tidy an Assay Matrix

We assume that the assay data (e.g. transcriptomic data) is either in an Excel file or flat text file. For example, your data may look like this:

In this data set, the columns are individual samples. The values in each row are the -Omic expression measurements for the gene in that row.

3.1 Import with readr

To import data files in .csv (comma-separated), .fwf (fixed-width), or .txt (tab-delimited) format, we recommend the readr package. You can .csv files with the read_csv function, fixed-width files with read_fwf, and general delimited files with read_delim. These functions are all from the readr package. Additionally, for data in .xls or .xlsx format, we recommend the readxl package. We would read a .csv data file via

The read_csv function warns us that the name of the first column is missing, but then automatically fills it in as X1. Further, this function prints messages to the screen informing you of the assumptions it makes when importing your data. Specifically, this message tells us that all the imported data is numeric (.default = col_double()) except for the gene name column (X1 = col_character()).

Let’s inspect our assay data frame. Note that the gene names were imported as a character column, as shown by the <chr> tag at the top of the first column. This data import step stored the row names (the gene names) as the first column, and preserved the column names (sample labels) of the data.

3.2 Tidy the Assay Data Frame

The assay input to the pathwayPCA package must be in tidy data format. The “Tidy Data” format requires that each observation be its own row, and each measurement its own column. This means that we must transpose our assay data frame, while preserving the row and column names.

To do this, we can use the TransposeAssay function. This function takes in a data frame as imported by the three readr functions based on data in a format similar to that shown above: genes are the rows, gene names are the first column, samples are stored in the subsequent columns, and all values in the assay (other than the gene names in the first column) are numeric.

This transposed data frame has the gene names as the column names and the sample names as a column of character (chr) values. Notice that the data itself is 17 genes measured on 36 samples. Before transposition, we had 37 columns because the feature names were stored in the first column. After transposition, we have 36 rows but 18 columns: the first column stores the sample names. This transposed data frame (after filtering to match the response data) will be supplied to the assayData_df argument in the CreateOmics function. (See the Creating Omics Data Objects vignette for more information on creating Omics-class objects.)

3.3 Subsetting a Tidy Data Frame

If ever we need to extract individual components of a tidy data frame, we can use the assay[row, col] syntax. If we need entire measurements (columns), then we can call the column by name with the assay$ColName syntax. For example,

Notice that the tibble object has 1 row and 18 columns. - If we need the third column of assayT_df—corresponding to Gene “LSS”—then we type

This tibble object has 36 rows and 1 column. - If we need the intersection of these two (the expression level of Gene “LSS” in Sample “T21101312”), then we type

This output would normally be a 1 by 1 tibble (which isn’t terribly helpful), so we add the drop = TRUE argument to “drop” the dimensions of the table. This gives us a single basic number (scalar). - If we need the third column of assayT_df, but we want the result back as a vector instead of a tibble, we call the column by name:

3.4 Data from a SummarizedExperiment Object

Oftentimes, genomic experiment data is stored in a SummarizedExperiment-class object. If your assay and response data are stored in such an object, use the SE2Tidy() function to extract the necessary information and return it as a tidy data frame. Because SummarizedExperiment objects can have more than one assay, you must specify the index for the assay of your choice with the whichAssay argument. Here is an example using the airway data:

Now we can look at a nice summary of the tidied assay and response data. This will drop all of the gene-specific metadata, as well as any experiment metadata. However, pathwayPCA can’t make use of this data anyway, so we haven’t lost much.

4. Import and Join Response Data

We now have an appropriate pathways list and a tidy -Omics assay data frame. All we need now is some response data. Let’s imagine that your phenotype data looks something like this:

We next import this response information. We can use the read_csv function once again:

pInfo_path <- system.file("extdata", "ex_pInfo_subset.csv",
                          package = "pathwayPCA", mustWork = TRUE)
pInfo_df <- read_csv(pInfo_path)
#> Rows: 36 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): Sample
#> dbl (1): eventTime
#> lgl (1): eventObserved
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

This phenotype data frame has a column for the sample labels (Sample) and the response information. In this case, our response is a survival response with an event time and observation indicator.

#> # A tibble: 36 × 3
#>    Sample    eventTime eventObserved
#>    <chr>         <dbl> <lgl>        
#>  1 T21101311     14.2  TRUE         
#>  2 T21101312      1    TRUE         
#>  3 T21101313      6.75 FALSE        
#>  4 T21101314      8.5  TRUE         
#>  5 T21101315      7.25 FALSE        
#>  6 T21101316      5    TRUE         
#>  7 T21101317     20    TRUE         
#>  8 T21101318     13.2  FALSE        
#>  9 T21101319      7.75 FALSE        
#> 10 T21101320      9    FALSE        
#> # … with 26 more rows

This pInfo data frame has the sample names as a column of character values, just like the transposed assay data frame. This is crucially important for the “joining” step. We can use the inner_join function from the dplyr library to retain only the rows of the assayT_df data frame which have responses in the pInfo data frame and vice versa. This way, every response in the phenotype data has matching genes in the assay, and every recorded gene in the assay matches a response in the phenotype data.

joinedExperiment_df <- inner_join(pInfo_df, assayT_df, by = "Sample")
#> # A tibble: 36 × 20
#>    Sample  eventTime eventObserved SOAT1   LSS  SQLE   EBP CYP51A1 DHCR7 CYP27B1
#>    <chr>       <dbl> <lgl>         <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>   <dbl>
#>  1 T21101…     14.2  TRUE           5.37  9.77  7.74  4.68    8.27  8.32    6.78
#>  2 T21101…      1    TRUE           5.52  9.78  8.06  5.12    8.21  8.33    6.47
#>  3 T21101…      6.75 FALSE          5.89  8.11  7.00  5.78    8.20  8.39    6.57
#>  4 T21101…      8.5  TRUE           5.62  8.67  8.59  5.64    8.07  8.64    6.47
#>  5 T21101…      7.25 FALSE          5.49  9.83  8.13  5.73    9.38  8.15    6.43
#>  6 T21101…      5    TRUE           5.58  9.85  8.55  5.13    9.40  8.71    6.56
#>  7 T21101…     20    TRUE           5.32 10.0   6.99  5.86    8.08  9.25    6.86
#>  8 T21101…     13.2  FALSE          5.49  9.72  7.47  5.16    6.67  7.37    6.70
#>  9 T21101…      7.75 FALSE          5.57  9.88  7.97  5.40    7.91  8.06    6.58
#> 10 T21101…      9    FALSE          5.16  9.87  7.42  5.50    7.43  8.68    6.55
#> # … with 26 more rows, and 10 more variables: DHCR24 <dbl>, HSD17B7 <dbl>,
#> #   MSMO1 <dbl>, FDFT1 <dbl>, SC5DL <dbl>, LIPA <dbl>, CEL <dbl>, TM7SF2 <dbl>,
#> #   NSDHL <dbl>, SOAT2 <dbl>

This requires you to have a key column in both data frames with the same name. If the key column was called “Sample” in the pInfo_df data set but “SampleID” in the assay, then the by argument should be changed to by = c("Sample" = "SampleID"). It’s much nicer to just keep them with the same names, however. Moreover, it is vitally important that you check your sample IDs. Obviously the recorded genetic data should pair with the phenotype information, but it is your responsibility as the user to confirm that the assay rows match the correct responses. You are ultimately responsible to defend the integrity of your data and to use this package properly.

5. Example Tidy Assay and Pathways List

Included in this package, we have a small tidy assay and corresponding gene subset list. We will load and inspect this assay. This data set has 656 gene expression measurements on 250 colon cancer patients. Further notice that the assay and overall survival response information have already been matched.

#> # A tibble: 250 × 659
#>    sampleID OS_time OS_event   JUN  SOS2  PAK3  RAF1 PRKCB   BTC  SHC1 PRKCA
#>    <chr>      <dbl>    <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 subj1     64.9          0  9.29  5.48  8.21  8.03  5.49  6.65  8.26  8.94
#>  2 subj2     59.8          0  9.13  6.35  8.33  7.94  6.26  7.02  8.39  9.61
#>  3 subj3     62.4          0  9.37  5.67  7.82  7.74  6.05  7.52  8.69  8.40
#>  4 subj4     54.5          0 10.6   4.94  8.79  7.64  5.37  6.87  7.81  9.80
#>  5 subj5     46.3          1  8.70  5.60  8.75  8.05  6.07  6.49  8.45  8.21
#>  6 subj6     55.9          0  9.78  5.36  7.56  8.07  5.90  6.39  8.87  8.22
#>  7 subj7     58.0          0  9.22  5.05  8.20  7.80  5.55  6.86  8.28  8.97
#>  8 subj8     54.0          0 10.3   5.33  7.82  7.89  6.27  6.25  8.66  9.71
#>  9 subj9      0.427        1 10.8   5.07  7.63  7.69  5.48  7.57  8.36  9.69
#> 10 subj10    41.4          0  9.52  5.50  7.48  7.53  5.71  7.33  8.54  8.14
#> # … with 240 more rows, and 648 more variables: ELK1 <dbl>, NRG1 <dbl>,
#> #   PAK2 <dbl>, MTOR <dbl>, PAK4 <dbl>, MAP2K4 <dbl>, EIF4EBP1 <dbl>,
#> #   BAD <dbl>, PRKCG <dbl>, NRG3 <dbl>, …

We also have a small list of 15 pathways which correspond to our example colon cancer assay. To create a toy example, we have curated this artificial pathways list to include seven significant pathways and eight non-significant pathways.

#> Object with Class(es) 'pathwayCollection', 'list' [package 'pathwayPCA'] with 2 elements: 
#>  $ pathways:List of 15

The pathways list and tidy assay (with matched phenotype information) are all the information we need to create an Omics-class data object.

6. Review

We now summarize our steps so far. We have

  1. Imported a .gmt file and saved the pathways stored therein as a pathwayCollection object using the read_gmt function.
  2. Imported an assay .csv file with the read_csv function from the readr package, and transposed this data frame into “tidy” form with the TransposeAssay function.
  3. Imported a phenotype information .csv file, and joined it to the assay data frame with the inner_join function from the dplyr package.

Now we are prepared to create our first Omics-class object for analysis with either AES-PCA or Supervised PCA. Please read vignette chapter 3: Creating Omics Data Objects.

Here is the R session information for this vignette:

#> R version 4.2.0 RC (2022-04-19 r82224)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.4 LTS
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/
#> LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> attached base packages:
#> [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> other attached packages:
#>  [1] SummarizedExperiment_1.26.0 Biobase_2.56.0             
#>  [3] GenomicRanges_1.48.0        GenomeInfoDb_1.32.0        
#>  [5] IRanges_2.30.0              S4Vectors_0.34.0           
#>  [7] BiocGenerics_0.42.0         MatrixGenerics_1.8.0       
#>  [9] matrixStats_0.62.0          survminer_0.4.9            
#> [11] ggpubr_0.4.0                survival_3.3-1             
#> [13] pathwayPCA_1.12.0           forcats_0.5.1              
#> [15] stringr_1.4.0               dplyr_1.0.8                
#> [17] purrr_0.3.4                 readr_2.1.2                
#> [19] tidyr_1.2.0                 tibble_3.1.6               
#> [21] ggplot2_3.3.5               tidyverse_1.3.1            
#> loaded via a namespace (and not attached):
#>  [1] nlme_3.1-157           bitops_1.0-7           fs_1.5.2              
#>  [4] lubridate_1.8.0        bit64_4.0.5            httr_1.4.2            
#>  [7] tools_4.2.0            backports_1.4.1        bslib_0.3.1           
#> [10] utf8_1.2.2             R6_2.5.1               DBI_1.1.2             
#> [13] mgcv_1.8-40            colorspace_2.0-3       withr_2.5.0           
#> [16] tidyselect_1.1.2       gridExtra_2.3          bit_4.0.4             
#> [19] compiler_4.2.0         cli_3.3.0              rvest_1.0.2           
#> [22] xml2_1.3.3             DelayedArray_0.22.0    labeling_0.4.2        
#> [25] sass_0.4.1             scales_1.2.0           survMisc_0.5.6        
#> [28] digest_0.6.29          rmarkdown_2.14         XVector_0.36.0        
#> [31] pkgconfig_2.0.3        htmltools_0.5.2        dbplyr_2.1.1          
#> [34] fastmap_1.1.0          highr_0.9              rlang_1.0.2           
#> [37] readxl_1.4.0           rstudioapi_0.13        jquerylib_0.1.4       
#> [40] farver_2.1.0           generics_0.1.2         zoo_1.8-10            
#> [43] jsonlite_1.8.0         vroom_1.5.7            car_3.0-12            
#> [46] RCurl_1.98-1.6         magrittr_2.0.3         GenomeInfoDbData_1.2.8
#> [49] lars_1.3               Matrix_1.4-1           munsell_0.5.0         
#> [52] fansi_1.0.3            abind_1.4-5            lifecycle_1.0.1       
#> [55] stringi_1.7.6          yaml_2.3.5             carData_3.0-5         
#> [58] zlibbioc_1.42.0        grid_4.2.0             crayon_1.5.1          
#> [61] lattice_0.20-45        haven_2.5.0            splines_4.2.0         
#> [64] hms_1.1.1              knitr_1.38             pillar_1.7.0          
#> [67] ggsignif_0.6.3         reprex_2.0.1           glue_1.6.2            
#> [70] evaluate_0.15          data.table_1.14.2      modelr_0.1.8          
#> [73] vctrs_0.4.1            tzdb_0.3.0             cellranger_1.1.0      
#> [76] gtable_0.3.0           km.ci_0.5-6            assertthat_0.2.1      
#> [79] xfun_0.30              xtable_1.8-4           broom_0.8.0           
#> [82] rstatix_0.7.0          KMsurv_0.1-5           ellipsis_0.3.2