rfaRm
Lara Selles Vidal, Rafael Ayala, Guy-Bart Stan, Rodrigo Ledesma-Amaro
April 7, 2020
The Rfam database is a collection of families of non-coding RNA and other structured RNA elements. Each family is defined by a multiple sequence alignment of family members, a consensus secondary structure and a covariance model, which integrates both multiple sequence alignment information and secondary structure information, and are related to the hidden Markov models employed by Pfam. “rfaRm” is a R package providing a client-side interface to the Rfam database. The package allows the search of the Rfam database by keywords or sequences, as well as the retrieval of all available information on specific Rfam families, such as member sequences, multiple sequence alignments, secondary structures and covariance models.
The Rfam database (Kalvari et al. 2017) is a collection of RNA sequences classified into different families of non-coding RNA, cis-regulatory elements and other structured RNA. It is designed to be analogous to the Pfam database of protein families (El-Gebali et al. 2018). The initial step for identifying an Rfam family is to generate a manually curated seed alignment of a set of known RNA sequences, ideally including secondary structure information. The Infernal software package (Nawrocki & Eddy 2013) is then used to build a covariance model for the family (a probabilistic profile of the sequence and secondary structure of the family), and to search for new homologs to add from a sequence database. When new homologs are added, the covariance model is updated, and the process is repeated until no new homologs are found.
Much of the information stored in Rfam could be useful if integrated within existing genomics and transcriptomics workflows and with functionalities of packages already available in BioConductor. rfaRm aims to facilitate the integration of the information provided by Rfam in existing workflows by providing an R client-side interface to the Rfam database.
rfaRm allows to search the Rfam database by keywords and sequence. In the search by keywords, the supplied keywodrs are matched against the fields of each entry, and accessions for the matching families are returned. In the search by sequence, the supplied sequence is used by Infernal to be searched against the Rfam library of covariance models, returning high-scoring hits which allow the identification of regions in the query sequence that belong to an Rfam family. Searches can be used to find RNA families of interest.
rfaRm also allows the retrieval of the information available in Rfam for each RNA family by providing a valid Rfam family identifier. The types of retrievable information for each Rfam family include:
The results of each query are returned as R objects that can be used to perform further analysis on the retrieved data with other R packages. Additionally, when appropriate they can also be saved to standard file formats, which can be read by commonly used Bioinformatics software.
Rfam families can be searched by keywords with the “rfamTextSearchFamilyAccession” function. The function returns a character vector where each element is a string representing the Rfam accession for an Rfam family that matched the searched keyword. Rfam accessions are always of the “RFXXXXX” structure, where X is any digit from 0 to 9. For example, in order to search for Rfam families corresponding to riboswitches, the keyword “riboswitch” can be passed as the query to the search function. The function will output a vector with the Rfam accessions of the matching families:
rfamTextSearchFamilyAccession("riboswitch")
## [1] "RF00174" "RF00059" "RF00162" "RF00504" "RF00379" "RF00050" "RF01051"
## [8] "RF00167" "RF01750" "RF00168" "RF01734" "RF00634" "RF01068" "RF00442"
## [15] "RF01057"
It is also possible to perform a sequence-based search with the “rfamSequenceSearch” function. The supplied query sequence is matched against the covariance models of the Rfam families by the Infernal software in order to identify regions in the query sequence that belong to one of the Rfam families. It should be noted that sequence-based searches can occasionally take long times to finish, sometimes up to several hours. The supplied string should be an RNA sequence, containing only the standard RNA nucleotides (A, U, G and C). The function will return a nested list where each top-level element represents a high-scoring hit, and is in itself a list including information such as the Rfam family with which the hit was found, and an alignment between the query sequence and the consensus sequence of the Rfam family. In the following example, only one high-scoring hit is identified in the provided sequence: an FMN riboswitch. The alignment with the consensus sequence of the Rfam family is saved to a text file, together with its associated secondary structure annotation.
sequenceSearch <- rfamSequenceSearch("GGAUCUUCGGGGCAGGGUGAAAUUCCCGACCGGUGGUAUAGUCCACGAAAGUAUUUGCUUUGAUUUGGUGAAAUUCCAAAACCGACAGUAGAGUCUGGAUGAGAGAAGAUUC")
length(sequenceSearch)
## [1] 1
names(sequenceSearch)
## [1] "FMN"
## Save the aligned consensus sequence and query sequence to a text file,
## together with secondary structure annotation
writeLines(c(sequenceSearch$FMN$alignmentQuerySequence, sequenceSearch$FMN$alignmentMatch,
sequenceSearch$FMN$alignmentHitSequence, sequenceSearch$FMN$alignmentSecondaryStructure),
con = "searchAlignment.txt")
Each Rfam family has two main identifiers: its accession (a string with the RFXXXX format), and its ID (a keyword related to the Rfam family functionality or RNA type). Both can be used interchangeably with all of the query functions available in rfaRm. It is possible to obtain the ID corresponding to an accession with the “rfamFamilyAccessionToID” function, and the accession corresponding to an ID with the “rfamFamilyIDToAccession” function.
rfamFamilyAccessionToID("RF00174")
## [1] "Cobalamin"
rfamFamilyIDToAccession("FMN")
## [1] "RF00050"
When the accession or ID of the Rfam family of interest is known, it is possible to query the database to retrieve different types of data about the family.
A summary with a short functional description of an Rfam family can be obtained with the “rfamFamilySummary” function.
rfamFamilySummary("RF00174")
## $rfamReleaseNumber
## [1] "15.00"
##
## $numberSequencesSeedAlignment
## [1] "434"
##
## $sourceSeedAlignment
## [1] "Barrick JE, Breaker RR"
##
## $numberSpecies
## [1] "7605"
##
## $RNAType
## [1] "Cis-reg; riboswitch;"
##
## $numberSequencesAll
## [1] "20864"
##
## $structureSource
## [1] "Predicted; PFOLD"
##
## $description
## [1] "Cobalamin riboswitch aptamer"
##
## $rfamAccession
## [1] "RF00174"
##
## $rfamID
## [1] "Cobalamin"
##
## $comment
## [1] "Riboswitches are metabolite binding domains within certain messenger RNAs that serve as precision sensors for their corresponding targets. Allosteric rearrangement of mRNA structure is mediated by ligand binding, and this results in modulation of gene expression [1]. This family represents a cobalamin riboswitch (B12-element) which is widely distributed in 5' UTRs of vitamin B(12)-related genes in bacteria. Cobalamin in the form of adenosylcobalamin (Ado-CBL) is known to repress expression of genes for vitamin B(12) biosynthesis and be transported by a post-transcriptional regulatory mechanism, which involves direct binding of Ado-CBL to 5' UTRs [2]. Cobalamin riboswitch structure was reported in PDB:4GMA [5], 4GXY [6] and 6VMY [7]. Cobalamin riboswitch forms a junctional structure that form a pocket that recognise adenosylcobalamin and is flanked by structural elements that connects by long-range interactions."
# The summary includes, amongst other data, a brief paragraph describing the
# family
rfamFamilySummary("RF00174")$comment
## [1] "Riboswitches are metabolite binding domains within certain messenger RNAs that serve as precision sensors for their corresponding targets. Allosteric rearrangement of mRNA structure is mediated by ligand binding, and this results in modulation of gene expression [1]. This family represents a cobalamin riboswitch (B12-element) which is widely distributed in 5' UTRs of vitamin B(12)-related genes in bacteria. Cobalamin in the form of adenosylcobalamin (Ado-CBL) is known to repress expression of genes for vitamin B(12) biosynthesis and be transported by a post-transcriptional regulatory mechanism, which involves direct binding of Ado-CBL to 5' UTRs [2]. Cobalamin riboswitch structure was reported in PDB:4GMA [5], 4GXY [6] and 6VMY [7]. Cobalamin riboswitch forms a junctional structure that form a pocket that recognise adenosylcobalamin and is flanked by structural elements that connects by long-range interactions."
The consensus secondary structure for an Rfam family can be retrieved, together with its consensus sequence, with the “rfamConsensusSecondaryStructure” function. RNA Secondary structure notation can be output in the WUSS or extended Dot-Bracket notations. In the example below, the consensus secondary structure of the Rfam family with accession RF00174 is retrieved in both notations, and saved to a file with the extended Dot-Bracket notation.
rfamConsensusSecondaryStructure("RF00174", format = "WUSS")
## [1] "uauaauuauuccgccg.c.c.G.GU....ccccccu......................aaaga.gggggguu....AAaa.GGGAA.c.cc.GG.UGc...............................aAauCCgg.gGCuGCCC.CC.GCaACUGUAA..ccgc.......gaacga.acccccaauau.............................gCCACUGgcccguaa..........................................................gggcCGGGAAGGc...gg.g..g...ga.aagcgauGAc.................................................................ccgC.gAGCCAGGAGACCuGCCg.gcg.gaggaaguaucac......................."
## [2] "::::::::[[[[-[[[.[.[.[.[[....<<<<<<_......................_____.>>>>>><_....__>(.((,,,.<.<<.<<.___...............................____>>>>.>AA<<___.__._>><<<<---..<<<<.......---<--.-<<-<<-----.............................<<<-<<<<<<_____.........................................................._>>>>>>--->>>...>>.>..>...>-.-->-------.................................................................>>>>.--aa>>->,,,)))]]]].]]].]-]]]]:::::::......................."
rfamConsensusSecondaryStructure("RF00174", format = "DB")
## [1] "uauaauuauuccgccg.c.c.G.GU....ccccccu......................aaaga.gggggguu....AAaa.GGGAA.c.cc.GG.UGc...............................aAauCCgg.gGCuGCCC.CC.GCaACUGUAA..ccgc.......gaacga.acccccaauau.............................gCCACUGgcccguaa..........................................................gggcCGGGAAGGc...gg.g..g...ga.aagcgauGAc.................................................................ccgC.gAGCCAGGAGACCuGCCg.gcg.gaggaaguaucac......................."
## [2] "........[[[[.[[[.[.[.[.[[....<<<<<<.............................>>>>>><.......>(.((....<.<<.<<.......................................>>>>.>AA<<........>><<<<.....<<<<..........<....<<.<<..................................<<<.<<<<<<................................................................>>>>>>...>>>...>>.>..>...>....>........................................................................>>>>...aa>>.>...)))]]]].]]].].]]]].............................."
rfamConsensusSecondaryStructure("RF00174", filename = "RF00174secondaryStructure.txt",
format = "DB")
## [1] "uauaauuauuccgccg.c.c.G.GU....ccccccu......................aaaga.gggggguu....AAaa.GGGAA.c.cc.GG.UGc...............................aAauCCgg.gGCuGCCC.CC.GCaACUGUAA..ccgc.......gaacga.acccccaauau.............................gCCACUGgcccguaa..........................................................gggcCGGGAAGGc...gg.g..g...ga.aagcgauGAc.................................................................ccgC.gAGCCAGGAGACCuGCCg.gcg.gaggaaguaucac......................."
## [2] "........[[[[.[[[.[.[.[.[[....<<<<<<.............................>>>>>><.......>(.((....<.<<.<<.......................................>>>>.>AA<<........>><<<<.....<<<<..........<....<<.<<..................................<<<.<<<<<<................................................................>>>>>>...>>>...>>.>..>...>....>........................................................................>>>>...aa>>.>...)))]]]].]]].].]]]].............................."
The generated files with secondary structure in the extended Dot-Bracket notation can be read by the R4RNA package (Lai et al. 2012), and used for further analysis and plotting of the corresponding RNA.
secondaryStructureTable <- readVienna("RF00174secondaryStructure.txt")
plotHelix(secondaryStructureTable)