The `DNABarcodes`

package offers a function to create DNA barcode sets capable
of correcting substitution errors or insertion, deletion, and substitution
errors. Existing barcodes can be analysed regarding their minimal, maximal and
average distances between barcodes. Finally, reads that start with a (possibly
mutated) barcode can be demultiplexed, i.e., assigned to their original
reference barcode.

The most common use cases are:

`create.dnabarcodes`

is used to create sets of DNA barcodes with error correction properties.`demultiplex`

is used to demultiplex a set of reads based on the used set of DNA barcodes`analyse.barcodes`

is used to analyse the properties of an existing set of DNA barcodes and to assess its error correction capabilities

Suppose we want to generate a set of 5bp long barcodes with default settings. The default enforces a minimum Hamming distance of 3 between each DNA barcode, which is sufficient to correct at least one substitution error:

```
library("DNABarcodes")
## Loading required package: Matrix
## Loading required package: parallel
mySet <- create.dnabarcodes(5)
## 1) Creating pool ... of size 592
## 2) Conway closing... done
show(mySet)
## [1] "GAGAA" "AGCAA" "CCTAA" "CAAGA" "ACGGA" "GTCGA" "TGTGA" "GGACA" "CTGCA"
## [10] "TACCA" "CGAAG" "TCGAG" "GTTAG" "ATAGG" "AAGCG" "GAATG" "TGCTG" "ACTTG"
## [19] "ACAAC" "CACAC" "TAGGC" "CTTGC" "TTACC" "GATCC" "AGGTC" "GCCAT" "TCAGT"
## [28] "AACGT" "TGGCT" "CAGTT"
```

In default mode, the Conway lexicographic algorithm is used to generate the
set. Conway’s algorithm is simple and most of the time efficient enough. When
sufficient computing power is available, Ashlock’s evolutionary algorithm
should be used. The parameter to change the set generation algorithm is named
*heuristic*.

```
mySetAshlock <- create.dnabarcodes(5, heuristic="ashlock")
## 1) Creating pool ... of size 592
## 2) Initiating Chromosomes done
## 3) Running Greedy Evolutionary done
show(mySetAshlock)
## [1] "ACGTT" "TGCAT" "TCTGA" "GAGAA" "ACCAA" "CGTAA" "CAAGA" "ATGGA" "AGACA"
## [10] "TACCA" "GTTCA" "GCATA" "TGGTA" "CTCTA" "GGAAG" "TCGAG" "CACAG" "ACAGG"
## [19] "TTCGG" "GATGG" "CTACG" "AAGCG" "TGTCG" "GTGTG" "AGCTG" "CCTTG" "CCAAC"
## [28] "AGGAC" "GTCAC" "TGAGC" "AACGC" "CTTGC" "GAACC" "TTGCC" "ACTCC" "CAGTC"
## [37] "TCCTC" "GGTTC" "CTGAT" "GCTAT" "GTAGT" "TAGGT" "AGTGT" "TCACT" "ATCCT"
## [46] "CATCT" "CGATT" "GACTT"
```

Importantly, the set of DNA barcodes is most often larger when Ashlock’s algorithm is used.

Finally, we want to create a set of 5bp long DNA barcodes that support the
correction of 2 substitutions. For that, we enforce a minimum distance of 5
using the parameter *dist*.

```
mySetDist5 <- create.dnabarcodes(5, dist=5, heuristic="ashlock")
## 1) Creating pool ... of size 592
## 2) Initiating Chromosomes done
## 3) Running Greedy Evolutionary done
show(mySetDist5)
## [1] "ATGAG" "GGATA" "CACCT" "TCTGC"
```

The number of errors \(k_c\) that a set of DNA barcodes can correct is expressed by the following formula. The variable \(dist\) gives the minimum distance of the set.

\[k_c \leq \left\lfloor \frac{dist - 1}{2} \right\rfloor\]

The number of \(k_d\) of detectable errors is given as:

\[k_d \leq dist - 1\]

The following table shows common distances \(dist\) and their error correction (\(k_c\)) and detection (\(k_d\)) properties:

\(dist\) | \(k_c\) | \(k_d\) |
---|---|---|

3 | 1 | 2 |

4 | 1 | 3 |

5 | 2 | 4 |

6 | 2 | 5 |

7 | 3 | 6 |

8 | 3 | 7 |

9 | 4 | 8 |

To obtain a large enough set for the targeted number of samples, it is often necessary to increase barcode length. Here, we want to correct 2 substitution errors and want to target at least 20 samples:

```
show(length(create.dnabarcodes(5, dist=5, heuristic="ashlock")))
## 1) Creating pool ... of size 592
## 2) Initiating Chromosomes done
## 3) Running Greedy Evolutionary done
## [1] 4
show(length(create.dnabarcodes(6, dist=5, heuristic="ashlock")))
## 1) Creating pool ... of size 1160
## 2) Initiating Chromosomes done
## 3) Running Greedy Evolutionary done
## [1] 8
show(length(create.dnabarcodes(7, dist=5, heuristic="ashlock")))
## 1) Creating pool ... of size 7568
## 2) Initiating Chromosomes done
## 3) Running Greedy Evolutionary done
## [1] 21
```

Therefore, 7bp long DNA barcodes should be used.

To generate sets of DNA barcodes that support the correction of insertions,
deletions, and substitutions (e.g., for the PacBio platform), a different
distance metric must be used. In a DNA context (i.e., the DNA barcode is
surrounded by other DNA nucleotides), the Sequence-Levenshtein distance is the
right choice. Here, we generate a set of 5bp long DNA barcodes capable of
correcting one indel or substitution error. The parameter to change the
distance metric is named *metric*:

```
mySeqlevSet <- create.dnabarcodes(5, metric="seqlev", heuristic="ashlock")
## 1) Creating pool ... of size 592
## 2) Initiating Chromosomes done
## 3) Running Greedy Evolutionary done
show(mySeqlevSet)
## [1] "AGCAC" "CCTAC" "GGTAA" "TCGAA" "AAGAG" "TTAGG" "GATGG" "CTGTG" "AATCC"
## [10] "GTCTC" "ACGTT" "TGCTT"
```

The requirements of the Sequence-Levenshtein distance are more stringent than those of the Hamming distance. The number of DNA barcodes in the set is therefore smaller, but the set is more robust.

By default, `create.dnabarcodes`

filters all those DNA sequences that contain
triplets, show a GC bias, or are self-complementary. Some of these filters may
be unnecessary on current or future platforms. For example, with the Illumina’s
Sequencing by Synthesis technology, the triplet filter is unnecessary. Here, we
generate a default set of DNA barcodes of length 5bp without filtering triplets:

```
mySetTriplets <- create.dnabarcodes(5, heuristic="ashlock", filter.triplets=FALSE)
## 1) Creating pool ... of size 640
## 2) Initiating Chromosomes done
## 3) Running Greedy Evolutionary done
show(mySetTriplets)
## [1] "TGTGA" "GCCAA" "CTCGA" "CGAAA" "GAAGA" "ACGGA" "TCACA" "CAGCA" "AGCCA"
## [10] "GTTCA" "GGGTA" "CCTTA" "ACAAG" "GAGAG" "TGCAG" "CTTAG" "TTAGG" "AACGG"
## [19] "ATGCG" "TATCG" "CAATG" "TCGTG" "GTCTG" "AGTTG" "GTAAC" "AGGAC" "CACAC"
## [28] "TCTAC" "TAGGC" "ATTGC" "AAACC" "TTCCC" "TGATC" "CTGTC" "ACCTC" "GATTC"
## [37] "CCGAT" "GGTAT" "AGAGT" "GTGGT" "TCCGT" "CATGT" "CTACT" "TGGCT" "GACCT"
## [46] "ACTCT" "GCATT" "CGCTT"
```

Note that the size of the pool of candidate barcodes is larger (640) than for the default setting that includes the triplet filter (592). Potentially, this allows the creation of larger sets, which we observed for longer DNA barcodes.

In the following, we describe methods to generate subsets of existing sets of DNA barcodes. Often, a researcher already has a pre-existing set of DNA barcodes, for example in chemical form as indexes from his supplier. When a smaller number of samples needs to multiplexed, he can choose a more robust subset of the existing indexes to get better error correction capabilities.

The set of candidate barcodes is given to `create.dnabarcodes`

through the
parameter *pool*.

For example, the researcher may have a RNASeq library preparation kit that includes 48 indexes. The set can already be used to correct 1 substitution. He creates a robust subset for the correction of 2 substitutions the following way:

```
data(supplierSet)
myRobustSet <- create.dnabarcodes(7, dist=5, pool=supplierSet, heuristic="ashlock")
## 1) Creating pool ... of size 48
## 2) Initiating Chromosomes done
## 3) Running Greedy Evolutionary done
show(myRobustSet)
## [1] "CCGGTTA" "AGACCTG" "GATACGC" "GGAAGAA" "AAGTGCA" "TCTTGAG" "ATCGTCG"
## [8] "CGATAGC" "CAAGCAT" "TGCAACT" "CTTCGTT"
```

Alternatively, we may want to create a subset that is capable of correcting indels in addition to substitutions:

```
myRobustSetSeqlev <- create.dnabarcodes(7, metric="seqlev", pool=supplierSet, heuristic="ashlock")
## 1) Creating pool ... of size 48
## 2) Initiating Chromosomes done
## 3) Running Greedy Evolutionary done
show(myRobustSetSeqlev)
## [1] "CTTCGTT" "CTCATGC" "GGAAGAA" "CCAGGAA" "CTGCAGA" "GCTTAGA" "CCTCTGA"
## [8] "GATCACA" "ATAGGCA" "CCAATCA" "GTGGTCA" "TGCACTA" "CCGGTTA" "TGTCCAG"
## [15] "AAGATGG" "ATCGTCG" "CAGTGTG" "AGACCTG" "TATGGAC" "CTAGTAC" "CGATAGC"
## [22] "CAACTGC" "TCGATCT" "GGTCTCT"
```

Demultiplexing is processing step where reads are assigned to their samples.
Demultiplexing is achieved easily with the `demultiplex`

function. In the
following example we assume to have a file that contains all reads that start
with a DNA barcode. The used set contains 48 7nt long DNA barcodes and was
generated to correct up to one substitution.

```
data(mutatedReads)
demultiplex(head(mutatedReads), supplierSet)
## read barcode distance
## CCTCAAC CGTCAACCCGTCAACACGTCAACACGTCAACGCGTCAACT CCTCAAC 1
## GGAAGAA CGAAGAACCGAAGAAACGAAGAACCGAAGAAGCGAAGAAA GGAAGAA 1
## GTTAGCA GTTAGCAGGTTAGCATGTTAGCAGGTTAGCATGTTAGCAG GTTAGCA 0
## CCAGGAA TCAGGAACTCAGGAACTCAGGAAGTCAGGAATTCAGGAAT CCAGGAA 1
## CTAGTAC CAAGTACTCAAGTACCCAAGTACCCAAGTACTCAAGTACG CTAGTAC 1
## GGTCTCT GGTCACTGGGTCACTAGGTCACTAGGTCACTAGGTCACTT GGTCTCT 1
```

It is important to supply the correct distance metric to `demultiplex`

using
the parameter *metric*. Otherwise the result will change unintentionally and in
unexpected ways.

Suppose we obtained a set of DNA barcodes in form of sample indexes from our
library preparation supplier. We want to analyse the set to understand the
errors that can be corrected or detected. The function `analyse.barcodes`

easily does just that:

```
analyse.barcodes(supplierSet)
## Description hamming seqlev levenshtein
## 1 Mean Distance 5.242908 3.656915 4.560284
## 2 Median Distance 5.000000 4.000000 5.000000
## 3 Minimum Distance 3.000000 1.000000 2.000000
## 4 Maximum Distance 7.000000 7.000000 7.000000
## 5 Guaranteed Error Correction 1.000000 0.000000 0.000000
## 6 Guaranteed Error Detection 2.000000 0.000000 1.000000
```

The output table lists mean, median, minimum, and maximum distances of each pair of DNA barcodes in the set. Finally, it lists the guaranteed error correction and detection capabilities that can be reached for the set. The analysis is presented for a choice of distance metrics: Hamming, Sequence-Levenshtein, and Levenshtein.

The `DNABarcodes`

package currently supports four distance metrics. Their
capabilities and properties are as follows:

A high enough Hamming Distance (`metric = "hamming"`

) allows the
correction/detection of substitutions. Due to the ignorance of insertions and
deletions, any changes to the length of the barcode as well as DNA context are
ignored, which makes the Hamming distance a simple choice.

A high enough Sequence Levenshtein distance (`metric = "seqlev"`

) allows the
correction/detection of insertions, deletions, and substitutions in scenarios
where the barcode was attached to a DNA sequence. Your sequence read, coming
from a Illumina, Roche, PacBio NGS machine of your choice, should then start
with that barcode, followed by the insert or an adapter or some random base
calls.

A high enough Levenshtein distance (`metric = "levenshtein"`

) allows the
correction/detection of insertions, deletions, and substitutions in scenarios
where the barcode was not attached anywhere, respective where the exact
beginning and end of the barcode is known. This is as far as we know in no
current NGS technology the case. Do not use this distance metric, except you
know what you are doing.

A high enough Phaseshift distance (`metric = "phaseshift"`

) allows the
correction/detection of substitutions as well as insertions or deletions that
occurred to the front of the DNA barcode. The intended target technology of
this distance is the Illumina Sequencing by Synthesis platform. We consider
this metric to be highly experimental.

We support four different algorithms for the generation of sets of DNA barcodes. Each has its particular advantages and disadvantages.

The heuristics `conway`

and `clique`

produce fast results but are not nearly as
good as the heuristics `sampling`

and `ashlock`

. The `clique`

heuristic is
marginally slower and needs more memory than the `conway`

heuristic because it
first constructs a graph representation of the pool.

The heuristic `ashlock`

is assumed to produce the best heuristic results with a
reasonable parameter configuration. New users should first try `conway`

and
then `ashlock`

with default parameters.

`Ashlock`

HeuristicThe Ashlock heuristic is an evolutionary algorithm. Briefly, it iterates over a
*population* of so-called chromosomes (not to be confused with actual genomic
chromosomes), while trying to improve set creation results in each step. In
each iteration, it retains the best chromosomes and replaces the rest with
mutated copies of the best ones.

The algorithm has two parameters:

- The number of iterations (
`iterations`

) - The size of the population (
`population`

)

More iterations and a larger population will (possibly) increase the size of the resulting set of DNA Barcodes but will also increase running time considerably.

We believe that 100 is a very good default for the number of iterations. The best chance, in our opinion, is to increase the number of chromosomes in the population considerably, for example up to 500 or even 1000.

The following examples show the increase in set size:

```
length(create.dnabarcodes(10, metric="seqlev", heuristic="ashlock",cores=32))
## 1) Creating pool ... of size 488944
## 2) Initiating Chromosomes done
## 3) Running Greedy Evolutionary done
## 2126
length(create.dnabarcodes(10, metric="seqlev", heuristic="ashlock",cores=32, population=500, iterations=250))
## 1) Creating pool ... of size 488944
## 2) Initiating Chromosomes done
## 3) Running Greedy Evolutionary done
## 2133
```