# 1 Introduction

The majority of statistical strategies used in differential expression analysis of high-dimensional biological data (e.g., gene expression, CyTOF) are based on linear models or (analysis of variance). Linear models are favored in many biological problems mainly due to their ease of use and interpretability. However, some biological problems (e.g., gene expression, CyTOF) often require nonlinear models as linear models might be insufficient to capture the relationship among the co-expression features (e.g., relationships among signaling markers within (and across) cell subpopulations in CyTOF data). Kernel-based approaches extend the class of linear models to nonlinear models in a computationally tractable manner. Additionally, kernel-based approaches assume a more general relationship based on a Hilbert space of functions spanned by a certain kernel function instead of assuming a linear functional relationship.

In this vignette, we introduce a differential expression analysis of high-dimensional biological data using kernel based score test, referred to as kernel-based differential analysis (cytoKernel), which can identify differentially expressed features. The cytoKernel R-package contains the CytoK() function, which computes the adjusted p value for each feature based on two groups. Further, the package calculates the shrunken effective size and its corresponding effective size standard deviation (sd) using the Adaptive Shrinkage (ash) procedure (Stephens (2017)). We demonstrate with an example data set that the Differential expression analysis using Kernel-based Score test (cytoKernel) procedure can be adapted to flow and mass cytometry experiments along with other biological experiments (e.g., gene expression, RNASeq).

Consider a matrix of order $$p \times n$$ with elements $$x_{ij}$$, $$i=1,\dots,p$$ and $$j=1,\dots,n$$, where $$n$$ is the number of samples (Group1+Group2 combined) and $$p$$ is the number of features.

We observe the data $$\{x_{ij},y_j\}$$, where $$x_{ij}$$ is the median intensity for sample $$j$$ and feature $$i$$. $$y_j$$ (binary response) is the group (or condition) label for sample $$j$$ ($$y_j=0$$ (Group1) and $$y_j=1$$ (Group2).

For feature $$i$$, $$i=1,\dots,p$$, we define a simple logistic (semi-parametric) model,

$logit[P(y_j=1)] = \beta_0+ f(x_{ij}),$

where $$f(.)$$ is a centered smooth function in a Reproducible Kernel Hilbert Space (RKHS) spanned by $$k$$.

If $$H_0:~f(.)=0$$, then feature expression value $$x_{ij}$$ is not related to the group labels $$y_j$$ for feature $$i$$ i.e., feature $$i$$ is differentially expressed.

Let $$K$$ be a $$n \times n$$ Gram matrix with $$K_{st}=K_{\rho}(x_{sj},~x_{tj})$$. Here, $$k_{\rho}(.,.)$$ is the reproducing kernel of RKHS which contains $$f(.)$$ and $$\rho$$ is an unknown kernel parameter.

Let $$\mathbf{y}$$ be a $$n \times 1$$ vector of $$0$$ and $$1$$. The score test statistic under null hypothesis ($$H_0:~f(.)=0$$) in the logistic model defined above is, $S(\rho) = \frac{Q(\rho)- \mu_{Q}}{\sigma_{Q}},$

where $$Q(\rho)=(\mathbf{y}-\mathbf{\hat{\mu_0}})^{\mathbf{T}}\mathbf{K}(\mathbf{y}-\mathbf{\hat{\mu_0}})$$, $$\mathbf{\hat{\mu_0}}={logit}^{-1}\hat{\beta_0}$$ and $$\hat{\beta_0}$$ is the estimate of $$\beta_0$$ under null model.

More details about the estimation of $$\mu_{Q}$$, $$\sigma_{Q}$$ and choices of $$\rho$$ in (Zhan, Patterson, and Ghosh (2015), Liu, Ghosh, and Lin (2008), Davies (1987).

$$Q(\rho)$$ can be rewritten as, $Q(\rho) = \sum_s {\sum_t {{k(x_{is},x_{it})}(y_{s}-\hat{\mu_0})(y_{t}-\hat{\mu_0})}},$ which is the component-wise product of matrices $$\mathbf{K}$$ and $$(\mathbf{y}-\mathbf{\hat{\mu_0}})(\mathbf{y}-\mathbf{\hat{\mu_0}})^{\mathbf{T}}$$.

We use a Gaussian distance based kernel: $k(x_{is},x_{it})= exp\left\{-\frac{(x_{is}-x_{it})^2}{\rho}\right\}.$

$$S(\rho)$$ has a Normal distribution for each value of $$\rho$$ (Davies (1987)).

The data structure is shown in Figure 1 below.