To meet the need for a reference control in cell-free methylated DNA immunoprecipitation-sequencing (cfMeDIP-seq)^{1,2} experiments, we designed spike-in controls and ligated unique molecular indexes (UMI) to adjust for PCR bias, and immunoprecipitation bias caused by the fragment length, G+C content, and CpG density of the DNA fragments that are immunoprecipitated. This enables absolute quantification of methylated DNA in picomoles, while retaining epigenomic information that allows for sensitive, tissue-specific detection as well as comparable results between different experiments. We designed DNA fragments with 2x3x3x3=54 combinations of methylation status (methylated and unmethylated), fragment length in basepair (bp) (80 bp, 160 bp, 320 bp), G+C content (35%, 50%, 65%), and fraction of CpGs within a fragment (1 CpG/ 80 bp, 1 CpG/ 40 bp, 1 CpG/ 20 bp). Spiky was developed for analyzing DNA methylation of cell-free DNA obtained from cfMeDIP-seq method using reference ‘spike-in’ controls. This package will:

- Assess methylation specificity in each sample
- Using the spike-in control data, output a Gaussian generalized linear model to predict molar amount on DNA samples
- Predict molar amount (picomoles) for each DNA sequence of interest, adjusting for fragment length G+C content and CpG fraction
- Adjust molar amount and bin the fragments into genomic windows used in analyses

Install and load the spiky package from Bioconductor.

- BAM or BED file
- Columns:
- chrom/contig: string
- position start: numeric
- position end: numeric
- read counts: integer
- fragment length (bp): integer
- G+C content [0-1]: numeric
- CpG number: Integer

- genomic - A GRanges object showing the genomic coverage of the BAM reads
- spikes - A GRanges object showing the coverage of the spikes.

For each combination of parameters, we designed two distinct spike-in sequences. One to be methylated and one to be unmethylated. The allows us to assess non-specific binding of the monoclonal antibody on a sample-by-sample basis. To calculate methylation specificity we take the number of methylated reads divided by the total number of reads. It is our recommendation that if methylation specificity is <0.98, then the sample should be flagged or removed from analysis as the cfMeDIP performed inadequately.

This calculation is done by the ‘methylation_specificity’ function.

- ssb_res as produced in the previous step

- Mean and median of the percent of methylated sequences for each spike-in after cfMeDIP-seq has been performed

For each batch of samples, the coefficients used in the Gaussian generalized linear model will differ. The ‘model_glm_pmol’ will calculate these coefficients and output the model to be used to calculate molar amount (picomoles) on the user’s DNA sequences of interest. We assume that all DNA sequences of interest are methylated after undergoing cfMeDIP-seq. As such, we build the Gaussian generalized linear model on only the methylated spike-in control fragments. A generated Bland-Altman plot will visualize how well the model performs.

- Example: data(ssb_res)

- Gaussian generalized linear model with coefficients specific to samples used in input data; .rda file

```
## Build the Gaussian generalized linear model on the spike-in control data
gaussian_glm <- model_glm_pmol(covg_to_df(ssb_res,spike=spike),spike=spike)
summary(gaussian_glm)
#>
#> Call:
#> glm(formula = conc ~ read_count + fraglen + GC + CpG_3, data = x)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -7.901e-04 -1.812e-04 5.277e-05 2.006e-04 4.682e-04
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 3.729e-03 3.779e-04 9.866 2.45e-09 ***
#> read_count -4.150e-07 9.332e-08 -4.447 0.000223 ***
#> fraglen -1.215e-05 9.877e-07 -12.303 4.59e-11 ***
#> GC 7.323e-06 5.698e-06 1.285 0.212698
#> CpG_3 3.523e-04 2.261e-04 1.558 0.134238
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 1.056382e-07)
#>
#> Null deviance: 3.3927e-05 on 25 degrees of freedom
#> Residual deviance: 2.2184e-06 on 21 degrees of freedom
#> AIC: -337.41
#>
#> Number of Fisher Scoring iterations: 2
```

For the samples in which the Gaussian generalized linear model was built, we will calculate the molar amount (picomoles) for each DNA sequence of interest.

- Example: data(ssb_res)

- Data frame
- Columns:
- chrom: string
- bin position start: numeric
- bin position end: numeric
- read counts: coverage of bin
- fragment length (bp): integer
- G+C content [0-1]: numeric
- CpG number: numeric
- pmol: numeric

```
# Predict pmol concentration
# To select a genome other than hg38, use BSgenome::available.packages() to find valid BSgenome name
#library("BSgenome.Hsapiens.UCSC.hg38")
sample_data_pmol <- predict_pmol(gaussian_glm, ssb_res,bsgenome="BSgenome.Hsapiens.UCSC.hg38",ret="df")
#> Adjusting for bin-level biases...
#> Attempting to load BSgenome.Hsapiens.UCSC.hg38...
#> OK.
#> Done.
#> Starting pmol prediction...
#> Done.
head(sample_data_pmol,n=1)
#> chrom range.start range.end
#> chr22:10536001-10536300 chr22 10536001 10536300
#> sequence
#> chr22:10536001-10536300 ATACAACAGAGAAATGCATAATGTCCAATCAATTTATTAAATTTCCAAAGTCGGTCACGCGCAGTGGCTCACACCTGTAATCTGAACACTTCAGGAGGCCGAGACGTGTGGATCACGAGGTCAGGAGTTGGAGACTAGCCTGACCAACATGGTGAAACCCCGTCTATACTAAAAATACAAAAATTAGCCAGGCATGGTGGCACGTGGCTGTAATCCCCGCTACTCAGGAGGCTGAGGCAGGAGAATTGCTTGAACCTGGGAGGCAGAGGTTGCAGTGAGCCGAGATGGCGCCACCGCACT
#> read_count fraglen GC CpG_3 pred_conc
#> chr22:10536001-10536300 0.16 300 0.5033333 2.289428 0.0008930107
```

For our analyses, we binned the genome into 300 bp non-overlapping windows. We then look overlap between fragments in our data with each of the 300 bp genomic windows. We adjust the molar amount (picomoles) by a multiplier. This multiplier is the proportion of overlap between our fragment and the 300 bp window. This is done for every fragment in our sample.

- Example: sample_pmol_data as produced in previous step

- Data frame
- Columns:
- chrom: string
- bin position start: numeric
- bin position end: numeric
- read counts: coverage of bin
- fragment length (bp): integer
- G+C content [0-1]: numeric
- CpG number: numeric
- pmol: numeric
- adjusted pmol: numeric

```
sample_binned_data <- bin_pmol(sample_data_pmol)
head(sample_binned_data,n=1)
#> chrom range.start range.end
#> chr22:10536001-10536300 chr22 10536001 10536300
#> sequence
#> chr22:10536001-10536300 ATACAACAGAGAAATGCATAATGTCCAATCAATTTATTAAATTTCCAAAGTCGGTCACGCGCAGTGGCTCACACCTGTAATCTGAACACTTCAGGAGGCCGAGACGTGTGGATCACGAGGTCAGGAGTTGGAGACTAGCCTGACCAACATGGTGAAACCCCGTCTATACTAAAAATACAAAAATTAGCCAGGCATGGTGGCACGTGGCTGTAATCCCCGCTACTCAGGAGGCTGAGGCAGGAGAATTGCTTGAACCTGGGAGGCAGAGGTTGCAGTGAGCCGAGATGGCGCCACCGCACT
#> read_count fraglen GC CpG_3 pred_conc
#> chr22:10536001-10536300 0.16 300 0.5033333 2.289428 0.0008930107
#> adjusted_pred_con
#> chr22:10536001-10536300 0.0001428817
```

1. Shen, S. Y. *et al.* Sensitive tumour detection and classification using plasma cell-free dna methylomes. *Nature* **563**, 579–583 (2018).

2. Shen, S. Y., Burgener, J. M., Bratman, S. V. & De Carvalho, D. D. Preparation of cfMeDIP-seq libraries for methylome profiling of plasma cell-free dna. *Nature protocols* **14**, 2749–2780 (2019).