\name{PairSummaries}
\alias{PairSummaries}
\title{
Summarize connected pairs in a LinkedPairs object
}
\description{
Takes in a ``LinkedPairs'' object and gene calls, and returns a data.frame of paired features.
}
\usage{
PairSummaries(SyntenyLinks,
              DBPATH,
              PIDs = FALSE,
              Score = FALSE,
              IgnoreDefaultStringSet = FALSE,
              Verbose = FALSE,
              Model = "Generic",
              DefaultTranslationTable = "11",
              AcceptContigNames = TRUE,
              OffSetsAllowed = NULL,
              Storage = 1,
              ...)
}
\arguments{
  \item{SyntenyLinks}{
A \code{LinkedPairs} object. In previous versions of this function, a \code{GeneCalls} object was also required, but this object is now carried forward from \code{NucleotideOverlap} inside the \code{LinkedPairs} object.
}
  \item{DBPATH}{
A SQLite connection object or a character string specifying the path to the database file constructed from DECIPHER's \code{Seqs2DB} function. This path is always required as ``PairsSummaries'' always computes the tetramer distance between paired sequences.
}
  \item{PIDs}{
Logical indicating whether to provide a PID for each pair. If \code{TRUE} all pairs will be aligned using DECIPHER's \code{AlignProfiles}. This step can be time consuming, especially for large numbers of pairs. Default is \code{FALSE}.
}
  \item{Score}{
Logical indicating whether to provide a length normalized score with DECIPHER's \code{ScoreAlignment} function. If \code{TRUE} all pairs will be aligned using DECIPHER's \code{AlignProfiles}. This step can be time consuming, especially for large numbers of pairs. Default is \code{FALSE}.
}
  \item{IgnoreDefaultStringSet}{
Logical indicating alignment type preferences. If \code{FALSE} (the default) pairs that can be aligned in amino acid space will be aligned as an \code{AAStringSet}. If \code{TRUE} all pairs will be aligned in nucleotide space. For \code{PairSummaries} to align the translation of a pair of sequences, both sequences must be tagged as coding in the ``GeneCalls'' object, and be the correct width for translation.
}
  \item{Verbose}{
Logical indicating whether or not to display a progress bar and print the time difference upon completion.
}
  \item{Model}{
A character string specifying a model to use to predict PIDs without performing an alignment. By default this argument is ``Generic'' specifying a generic PID prediction model based on PIDs computed from a randomly selected set of genomes. Currently no other models are included. Users may also supply their own model of type ``glm'' if they so desire in the form of an RData file. This model will need to take in some, or of the columns of statistics per pair that PairSummaries supplies.
}
  \item{DefaultTranslationTable}{
A character used to set the default translation table for \code{translate}. Is passed to \code{getGeneticCode}. Used when no translation table is specified in the ``GeneCalls'' object.
}
  \item{AcceptContigNames}{
Match names of contigs between gene calls object and synteny object. Where relevant, the first white space and everything following are removed from contig names. If \code{TRUE}, PairSummaries assumes that the contigs at each position in the synteny object and ``GeneCalls'' object are in the same order. Is automatically set to \code{TRUE} when ``GeneCalls'' are of class ``GRanges''. Is currently \code{TRUE} by default.
}
  \item{OffSetsAllowed}{
Defaults to \code{NULL}. Supplying an integer vector will indicate gap sizes to attempt to fill. A value of \code{2} will attempt to span gaps of size 1. If a vector larger than 1 is provided, i.e. \code{c(2, 3)}, will attempt to query all gap sizes implied by the vector, in this case gaps of size 1 and 2.
}
  \item{Storage}{
Numeric indicating the approximate size a user wishes to allow for holding \code{StringSet}s in memory to extract gene sequences, in ``Gigabytes''. The lower \code{Storage} is set, the more likely that \code{PairSummaries} will need to reaccess \code{StringSet}s when extracting gene sequences. The higher \code{Storage} is set, the more sequences \code{PairSummaries} will attempt to hold in memory, avoiding the need to re-access the source database many times. Set to 1 by default, indicating that \code{PairSummaries} can store a ``Gigabyte'' of sequences in memory at a time.
}
  \item{...}{
Arguments to be passed to \code{AlignProfiles}, and \code{DistanceMatrix}.
}
}
\details{
The \code{LinkedPairs} object generated by \code{NucleotideOverlap} is a container for raw data that describes possible orthologous relationships, however ultimate assignment of orthology is up to user discretion. \code{PairSummaries} generates a clear table with relevant statistics for a user to work with as they choose. The option to align all pairs, though onerous can allow users to apply a hard threshold to predictions by PID, while built in models can allow more expedient thresholding from predicted PIDs.
}
\value{
A data.frame of class ``data.frame'' and ``PairSummaries'' of paired genes that are connected by syntenic hits. Contains columns describing the k-mers that link the pair. Columns ``p1'' and ``p2'' give the location ids of the the genes in the pair in the form ``DatabaseIdentifier_ContigIdentifier_GeneIdentifier''. ``ExactMatch'' provides an integer representing the exact number of nucleotides contained in the linking k-mers. ``TotalKmers'' provides an integer describing the number of distinct k-mers linking the pair. ``MaxKmer'' provides an integer describing the largest k-mer that links the pair. A column titled ``Consensus'' provides a value between zero and 1 indicating whether the kmers that link a pair of features are in the same position in each feature, with 1 indicating they are in exactly the same position and 0 indicating they are in as different a position as is possible. The ``Adjacent'' column provides an integer value ranging between 0 and 2 denoting whether a feature pair's direct neighbors are also paired. Gap filled pairs neither have neighbors, or are included as neighbors. The ``TetDist'' column provides the euclidean distance between oligonucleotide - of size 4 - frequences between predicted pairs. ``PIDType'' provides a character vector with values of ``NT'' where either of the pair indicates it is not a translatable sequence or ``AA'' where both sequences are translatable. If users choose to perform pairwise alignments there will be a ``PID'' column providing a numeric describing the percent identity between the two sequences. If users choose to predict PIDs using their own, or a provided model, a ``PredictedPID'' column will be provided.
}
\author{
Nicholas Cooley \email{npc19@pitt.edu}
}

\seealso{
\code{\link{FindSynteny}}, \code{\link{Synteny-class}}, \code{\link{NucleotideOverlap}}
}
\examples{
# this function will be deprecated soon,
# please see the new SummarizePairs() function.
DBPATH <- system.file("extdata",
                      "Endosymbionts_v05a.sqlite",
                      package = "SynExtend")
                      
data("Endosymbionts_LinkedFeatures", package = "SynExtend")

Pairs <- PairSummaries(SyntenyLinks = Endosymbionts_LinkedFeatures,
                       PIDs = FALSE,
                       DBPATH = DBPATH,
                       Verbose = TRUE)
}
