Contents

1 Introduction

Prediction of protein-protein interaction (PPI) networks is an important approach to gain knowledge about protein interactions in model organisms where only a small number of PPI information is available. Current PPI databases, providing predicted interaction data, lack many organisms or contain less reproducible information about the predicted interactions. Currently available prediction approaches are mainly based on biological data (functional annotation, co-expression etc.) which often are not available for many “less established” organisms, for example, where only sequence data is available. In addition, it is of major interest to get knowledge about a certain pathway in such a “less-studied” organism. To overcome these drawbacks Path2PPI can be used to predict proteins and interactions of a certain pathway of interest in a target
organism by using and combining the PPIs of other well established model organisms.

To do so, it needs a list of proteins of interest from each reference species and the result files produced by the local NCBI BLAST (Camacho et al., 2009) tool (see next chapter). The relevant interactions based on the users’ protein lists are automatically extracted from the corresponding iRefIndex files (Razick et al., 2008).

2 Preparation of the data

In this tutorial, we make use of the test data set provided with the package. This data set consists of all data files necessary to predict the interactions of the induction step of autophagy in Podospora anserina by means of the corresponding PPIs in human and yeast. Hence, we first load the “autophagy induction” test data set:

data(ai) #Load test data set
ls() #"ai" contains six data objects
## [1] "human.ai.irefindex"   "human.ai.proteins"    "pa2human.ai.homologs"
## [4] "pa2yeast.ai.homologs" "yeast.ai.irefindex"   "yeast.ai.proteins"

As stated by ls() the test data set contains six data objects (three for each of the two reference species human and yeast). First, the algorithm requires a list of proteins which define the corresponding pathway for each reference species, defined in “human.ai.proteins” and “yeast.ai.proteins” (see section 2.1). Second, the algorithm requires the data frames which contain the interactions of each reference species defined in “human.ai.irefindex” and “yeast.ai.irefindex” (described in more detail in section 2.1). Third, the algorithm needs to know the homologous relations between the target species with each reference species. These relations are defined in the data frames “pa2human.ai.homologs” and “pa2yeast.ai.homologs” (we describe this in more detail in section 2.2).

If you want to use Path2PPI for your own demands, you have to generate and prepare the necessary data files.

2.1 Proteins and interactions of pathways of interest

We list the proteins which are associated with a specific pathway of interest in a character vector for each reference species. To give you an example for such lists, we take a brief look into the loaded data set. Among others, we found the two named character vectors “human.ai.proteins” and “yeast.ai.proteins” which consist of the corresponding proteins for yeast and human, our two reference species:

human.ai.proteins
##  P42345  O75385  Q8IYT8  Q6PHR2  O75143 
##  "MTOR"  "ULK1"  "ULK2"  "ULK3" "ATG13"
yeast.ai.proteins
##  P35169  P32600  P53104  Q06410  Q12527  Q06628  P39968 
##  "TOR1"  "TOR2"  "ATG1" "ATG17" "ATG11" "ATG13"  "VAC8"

In this example, the values are the trivial names of the proteins and the names are the actual protein identifiers. Path2PPI also accepts simple character vectors where the values are the protein identifiers, if the trivial names of the proteins are not available. For example, this simple character vector, only consisting of the protein identifiers, would be also a valid protein list:

## [1] "P42345" "O75385" "Q8IYT8" "Q6PHR2" "O75143"

The major advantage of using a named character vector with the trivial names, is that these names will be shown in the plots allowing for a more comfortable interpretation. You can use various accession formats for the protein identifiers which are supported by iRefIndex (e.g. UniProt, SwissProt, Ensembl). However, we urgently recommend to use UniProt identifiers, since those are the most established ones.

Use the default R functions to load your own protein lists of interest into R (e.g. read.table).

These proteins of interest are applied to find relevant interactions in the corresponding species iRefIndex file. iRefIndex tables are available for the seven most established model organisms and can be found here: http://irefindex.org/wiki/index.php?title=iRefIndex. You can also use the corresponding iRefR-package to directly archive the iRefIndex data frames from this page. Unfortunately, the package is not updated as frequently as the web page and it may be that you do not get the latest release of a corresponding file.

In the “autophagy induction” test data set, only a very small part of the iRefIndex files for yeast and human are provided which contain the relevant interactions necessary for this tutorial. The complete files are much larger. The data frames “human.ai.irefindex” and “yeast.ai.irefindex” in the test data set contain these corresponding iRefIndex parts. See:

str(human.ai.irefindex)
str(yeast.ai.irefindex)

2.2 Get homology files using NCBI BLAST+

You also need the result files produced by the BLAST+ toolkit provided by the NCBI web page: http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download. The test data set already includes the necessary results of the BLAST searches of the proteoms of P. anserina against the proteoms of yeast and human (“pa2human.ai.homologs” and “pa2yeast.ai.homologs”):

head(pa2yeast.ai.homologs)
##          V1     V2    V3  V4  V5 V6  V7  V8   V9  V10   V11   V12
## 534  B2AX00 P53104 25.66 304 165 12  15 268   21  313 1e-15  79.0
## 960  B2AXW7 P53104 28.52 291 154 10 510 751   30  315 7e-24 106.0
## 1278 B2AUR5 P35169 26.74 288 154 10 602 837 2078 2360 2e-11  65.9
## 1279 B2AUR5 P32600 25.09 275 152  9 610 835 2090 2359 1e-09  60.5
## 1555 B2B7B1 P53104 25.99 177  99  7 807 963  149  313 5e-09  59.3
## 2469 B2ASL9 P53104 26.42 352 171 13  20 314   24  344 2e-19  87.8
head(pa2human.ai.homologs)
##          V1     V2    V3  V4  V5 V6  V7  V8 V9 V10   V11   V12
## 2123 B2AX00 Q6PHR2 33.09 269 146 15  11 268 13 258 5e-24 106.0
## 2177 B2AX00 O75385 30.30 231 131  9  24 239 22 237 3e-22 103.0
## 2213 B2AX00 Q8IYT8 29.24 236 142  8  24 247 15 237 4e-21  99.4
## 4588 B2AXW7 Q6PHR2 32.96 267 159 12 499 754  6 263 9e-30 125.0
## 4649 B2AXW7 Q8IYT8 29.17 216 136  6 509 714 14 222 2e-23 107.0
## 4658 B2AXW7 O75385 29.95 217 133  9 509 714 21 229 5e-23 106.0

The second column (V2) contains the protein identifiers of the corresponding reference species to which the protein of the target species (here: P. anserina) in the first column (V1) is homologous. Keep in mind that these protein identifiers are equal to those we used in the protein lists described above.

If you are unfamiliar with this toolkit, we refer to the BLAST+ user manual or to the broadly available tutorials in the web.

Nevertheless, we want to give you a very short description on how to use this toolkit. We assume that you already have loaded and installed the NCBI BLAST+ toolkit and you also have each proteome file in FASTA format of each species (target and reference species). You first have to create the databases for each reference species, here, for human and yeast:

makeblastdb -in human.fa -input_type fasta -dbtype prot -out human_proteins 
-title human_proteins
makeblastdb -in yeast.fa -input_type fasta -dbtype prot -out yeast_proteins 
-title yeast_proteins

Subsequently, you can start the comprehensive BLAST searches using the FASTA file of your target species (here: P. anserina):

blastp -query panserina.fa -db human_proteins -out human_panserina.out -evalue 
0.0001 -outfmt 6
blastp -query panserina.fa -db yeast_proteins -out yeast_panserina.out -evalue 
0.0001 -outfmt 6

Please, make sure that you use as the output format the tab delimited list indicated by the parameter -outfmt 6. The two species-specific homology files which are now generated, can be imported into your R session, using the function read.table, and subsequently used as data frames for the Path2PPI-package.

3 Predict PPI in target species

After the necessary data sets are generated or loaded, respectively, we can start with the prediction.

3.1 The Path2PPI object

An object of the class Path2PPI represents the major instance which is responsible for storing and managing of each data set and for each computation and prediction step. Hence, we first have to create a new instance of the class Path2PPI with the corresponding information:

ppi <- Path2PPI("Autophagy induction", "Podospora anserina", "5145")

The arguments are the title of the pathway we want to predict, the taxonomy name of the target species (“Podospora anserina”) and its corresponding taxonomy id (“5145”).

3.2 Add reference species

This new instance does not contain any reference species or a predicted PPI, yet:

ppi
## Autophagy induction in Podospora anserina (5145)
## ------------------------------------------------- 
## No reference species yet.
## ------------------------------------------------- 
## No predicted PPI yet.

To add the reference species, for which we have collected the necessary data, we make use of the method addReference.

ppi <- addReference(ppi, "Homo sapiens", "9606", human.ai.proteins, 
                    human.ai.irefindex, pa2human.ai.homologs)
## Search for all relevant interactions:
## 0%--25%--50%--75%--100%
## Remove irrelevant homologs.

Besides the taxonomy name and the taxonomy identifier, this method requires the list, containing the proteins of the pathway of interest, the corresponding iRefIndex-data frame or the file name of the corresponding iRefIndex file, and the species specific homology data set generated by the NCBI BLAST+ toolkit. This method searches for all relevant interactions in the iRefIndex data frame. There are different and often ambiguous protein identifiers defined in an iRefIndex file and the “major” identifiers are not necessarily those defined in the corresponding “major” columns “uidA” and “uidB”. Furthermore, iRefIndex also contains complexes. Hence, this method applies an advanced search algorithm to find automatically relevant interactions associated with the pathway or the proteins of interest, respectively. You do not have to predefine the identifiers’ types (UniProt, Swissprot, Ensembl etc.), since these types are often assigned ambiguously. The algorithm searches for each identifier in 10 columns where any type of identifier or accession number is defined, for example, “uidA”, “altA”, “OriginalReferenceA”, “FinalReferenceA”, “aliasA”, “uidB”, “altB”, “OriginalReferenceB”, “FinalReferenceB” and “aliasB”. Additionally, it searches for each complex to which one or more of the predefined proteins are associated. Subsequently, each homologous relationship which is not relevant for the previously found interactions is declined.

In the same manner we add yeast to our Path2PPI-instance:

ppi <- addReference(ppi, "Saccharomyces cerevisiae (S288c)", "559292", 
                    yeast.ai.proteins, yeast.ai.irefindex, 
                    pa2yeast.ai.homologs) 
## Search for all relevant interactions:
## 0%--25%--50%--75%--100%
## Remove irrelevant homologs.

In this tutorial, we want to predict the PPIs in based on these two reference species. You can use other and/or more reference species for your demands.

Now, we can get all processed information about the added reference species using the method :

showReferences(ppi)
## Homo sapiens (TaxId: 9606)
## --------------------------- 
## 5 proteins (0 not used)
## 894 interactions:
## - 6 interactions have both interactors in protein list.
## - 349 interactions have at least one interactor in protein list.
## - 660 interactions in 102 protein complexes.
## 
## 
## Saccharomyces cerevisiae (S288c) (TaxId: 559292)
## ------------------------------------------------- 
## 7 proteins (0 not used)
## 2910 interactions:
## - 15 interactions have both interactors in protein list.
## - 834 interactions have at least one interactor in protein list.
## - 2207 interactions in 102 protein complexes.

If we want to know which interactions have been found or which interactions are associated with the proteins of interest in a specific reference species (e.g. human), we can use the method as follows:

interactions <- showReferences(ppi, species="9606", 
                               returnValue="interactions")
head(interactions)
##       ref      A.db                 A.accession A.in.prot.list      B.db
## 1  287217   complex qx1eWqPyfshUfC/6x17AYjcT/3w          FALSE  replaced
## 7  287217   complex qx1eWqPyfshUfC/6x17AYjcT/3w          FALSE uniprotkb
## 13 287217   complex qx1eWqPyfshUfC/6x17AYjcT/3w          FALSE uniprotkb
## 19 436141 uniprotkb                  A0A090N900          FALSE  replaced
## 32 502959 uniprotkb                      P62942          FALSE  replaced
## 45 408315 uniprotkb                      Q8N122          FALSE  replaced
##    B.accession B.in.prot.list
## 1       P42345           TRUE
## 7   A0A0A0MR05          FALSE
## 13      Q6R327          FALSE
## 19      P42345           TRUE
## 32      P42345           TRUE
## 45      P42345           TRUE

For more information about the method we refer to the corresponding manual page (?showReferences).

3.3 Predict PPI

After we added all reference species and all necessary data, we can start with the prediction. To predict the PPI network in the target species we use the method predictPPI:

ppi <- predictPPI(ppi,h.range=c(1e-60,1e-20))
## Begin with Homo sapiens
## 6 interactions processed. These lead to 5 interactions in target species.
## -------------------------------
## Begin with Saccharomyces cerevisiae (S288c)
## 15 interactions processed. These lead to 22 interactions in target species.
## -------------------------------
## Combine results to one single PPI.
## A total of 13 putative interactions were predicted in target species.

This method uses different arguments to influence the prediction approach and to define the output of the PPI network. For a detailed description of the various arguments we refer to the corresponding manual (?predictPPI). Here, we only use the argument h.range where the first value corresponds to the lower bound and the second value to the upper bound of the homology range. That means that each E-value which is equal or less the lower bound will be scored with 1, and each E-value which is equal or larger than the upper bound will be scored with 0 (see appendix for a detailed description).

According to the reports generated by this method two species specific PPI networks led to a PPI network in the target species with 13 interactions. To achieve further information about the former prediction step, we just type:

ppi #show(ppi)
## Autophagy induction in Podospora anserina (5145)
## ------------------------------------------------- 
## 2 reference species: 9606, 559292
## ------------------------------------------------- 
## Number of predicted proteins: 8
## Number of predicted interactions: 13
## Predicted PPI based on 2 reference species:
## 9606 (2 interactions and 4 homologous relations)
## 559292 (11 interactions and 11 homologous relations)
## ------------------------------------------------- 
## Settings:
## Homology threshold: 1e-05
## Homology range: [1e-60,1e-20]
## Interactions threshold: 0.7
## Consider complexes: FALSE

4 Results of the prediction

After we predicted the PPI network of the “autophagy induction” pathway in P. anserina we now want to know how this network looks like. And we want to know which proteins and interactions actually are associated with this pathway in our target species.

4.1 Plotting the results

To get a graphical representation of the predicted PPI network, Path2PPI provides three different plotting types. First, to get only the predicted PPI, we use the plot function of the Path2PPI-object, which is based on the igraph plotting function (Csardi and Nepusz, 2006):

set.seed(12) #Set random seed
coordinates <- plot(ppi, return.coordinates=TRUE)

There are various arguments provided with this method (see ?plot.Path2PPI). Here, we initially use the return.coordinates argument since we want to save the coordinates of the vertices for the next plotting approach.

In the second approach, we want to know from which reference species the different predicted interactions originated. We assign the previously computed coordinates to the plotting function since we want to compare both networks:

plot(ppi,multiple.edges=TRUE,vertices.coordinates=coordinates)

The different colors of the edges correspond to the species, see the taxonomy identifiers in the legend: 5154 for P. anserina, 9606 for human, and 559292 for yeast) from which the interaction was deduced. For example, we can see that the edge between the proteins “B2AWL” and “B2AE79” in the upper network is thicker than the others. This indicates that the interaction was found in more than one reference species. In the second plot, we see that this interaction is based on six interactions found in yeast and two interactions found in human.

Next, we want to plot the so-called hybrid PPI network, where we additionally can see the underlying reference interactions or the underlying reference PPI networks, respectively, and each homologous relationship. We also want to set the vertex labels, since we know the trivial names of the target species proteins. You can set the label for each protein of each species. Additionally, we want to change the species colors:

set.seed(40)
target.labels<-c("B2AE79"="PaTOR","B2AXK6"="PaATG1", 
                 "B2AUW3"="PaATG17","B2AM44"="PaATG11",
                 "B2AQV0"="PaATG13","B2B5M3"="PaVAC8")
species.colors <- c("5145"="red","9606"="blue","559292"="green")
plot(ppi,type="hybrid",species.colors=species.colors,
     protein.labels=target.labels)