Contents

1 Background

Quality Control (QC) has been considered as an essential step in the metabolomics platform for high reproducibility and accuracy of data. The repetitive use of the same QC samples is more and more accepted for correcting the signal drift during the sequence of MS run order, especially beneficial to improve the quality of data in multi-block experiments of large-scale metabolomic study. statTarget is an easy use tool to provide a graphical user interface for quality control based signal shift correction, integration of metabolomic data from multi-batch experiments, and comprehensive statistic analysis in non-targeted or targeted metabolomics. This document is intended to guide the user to use statTargetGUI to perform metabolomic data analysis. Note that this document will not describe the inner workings of statTarget algorithm.

1.1 System requirements

Dependent on R (>= 3.3.0)

1.2 Opening the GUI

Load the package with biocLite():

source("https://bioconductor.org/biocLite.R")
#> Bioconductor version 3.7 (BiocInstaller 1.30.0), ?biocLite for help
biocLite("statTarget")
#> BioC_mirror: https://bioconductor.org
#> Using Bioconductor 3.7 (BiocInstaller 1.30.0), R 3.5.0 (2018-04-23).
#> Installing package(s) 'statTarget'

For mac PC, the package statTargetGUI requires X11 support (XQuartz). Download it from https://www.xquartz.org.

2 GUI overview

An easy to use tool providing a graphical user interface (Figure 1) for quality control based signal correction, integration of metabolomic data from multiple batches, and comprehensive statistic analysis for non-targeted and targeted approaches. (URL: https://github.com/13479776/statTarget)

2.1 What does statTarget offer statistically

The main GUI of statTarget has two basic sections. The first section is Shift Correction. It includes quality control-based robust LOESS signal correction (QC-RLSC) that is a widely accepted method for quality control based signal correction and integration of metabolomic data from multiple analytical batches (Dunn WB., et al. 2011; Luan H., et al. 2015). The second section is Statistical Analysis. It provides comprehensively computational and statistical methods that are commonly applied to analyze metabolomics data, and offers multiple results for biomarker discovery.

statTargetGUI

Section 1 - Shift Correction provide QC-RLSC algorithm that fit the QC data, and each metabolites in the true sample will be normalized to the QC sample. To avoid overfitting of the observed data, LOESS based generalised cross-validation (GCV) would be automatically applied, when the QCspan was set at 0.

Section 2 - Statistical Analysis provide features including Data preprocessing, Data descriptions, Multivariate statistics analysis and Univariate analysis.

Data preprocessing : 80-precent rule, glog transformation, KNN imputation, Median imputation and Minimum values imputation.

Data descriptions : Mean value, Median value, Sum, Quartile, Standard derivatives, etc.

Multivariate statistics analysis : PCA, PLSDA, VIP, Random forest.

Univariate analysis : Welch’s T-test, Shapiro-Wilk normality test and Mann-Whitney test.

Biomarkers analysis: ROC, Odd ratio.

2.2 Running Shift Correction from the GUI

Pheno File

Meta information includes the Sample name, class, batch and order. Do not change the name of each column. (a) Class: The QC should be labeled as NA. (b) Order : Injection sequence. (c) Batch: The analysis blocks or batches with ordinal number,e.g., 1,2,3,…. (d) Sample name should be consistent in Pheno file and Profile file. (See the example data)

Profile File

Expression data includes the sample name and expression data.(See the example data)

NA.Filter

NA.Filter: Removing peaks with more than 80 percent of missing values (NA or 0) in each group. (Default: 0.8)

QCspan

The smoothing parameter which controls the bias-variance tradeoff. The common range of QCspan value is from 0.2 to 0.75. If you choose a span that is too small then there will be a large variance. If the span is too large, a large bias will be produced. The default value of QCspan is set at ‘0’, the generalised cross-validation will be performed for choosing a good value, avoiding overfitting of the observed data. (Default: 0)

degree

Lets you specify local constant regression (i.e., the Nadaraya-Watson estimator, degree=0), local linear regression (degree=1), or local polynomial fits (degree=2). (Default: 2)

Imputation

Imputation: The parameter for imputation method.(i.e., nearest neighbor averaging, “KNN”; minimum values for imputed variables, “min”; median values for imputed variables (Group dependent) “median”. (Default: KNN)

2.3 Running Statistical Analysis from the GUI

Stat File

Expression data includes the sample name, group, and expression data.

NA.Filter

Removing peaks with more than 80 percent of missing values (NA or 0) in each group. (Default: 0.8)

Imputation

The parameter for imputation method.(i.e., nearest neighbor averaging, “KNN”; minimum values for imputed variables, “min”; median values for imputed variables (Group dependent) “median”. (Default: KNN)

Glog

Generalised logarithm (glog) transformation for Variance stabilization
(Default: TRUE)

Scaling Method

Scaling method before statistic analysis (PCA or PLS). Pareto can be used for specifying the Pareto scaling. Auto can be used for specifying the Auto scaling (or unit variance scaling). Vast can be used for specifying the vast scaling. Range can be used for specifying the Range scaling. (Default: Pareto)

M.U.Stat

Multiple statistical analysis and univariate analysis (Default: TRUE)

Permutation times

The number of random permutation times for PLS-DA model (Default: 20)

PCs

PCs in the Xaxis or Yaxis: Principal components in PCA-PLS model for the x or y-axis (Default: 1 and 2)

nvarRF

The number of variables in Gini plot of Randomforest model (=< 100). (Default: 20)

Labels

To show the name of sample in the Score plot. (Default: TRUE)

Multiple testing

This multiple testing correction via false discovery rate (FDR) estimation with Benjamini-Hochberg method. The false discovery rate for conceptualizing the rate of type I errors in null hypothesis testing when conducting multiple comparisons. (Default: TRUE)

Volcano FC

The up or down -regulated metabolites using Fold Changes cut off values in the Volcano plot. (Default: > 2 or < 1.5)

Volcano Pvalue

The significance level for metabolites in the Volcano plot.(Default: 0.05)

3 Investigating the results

Download the statTarget tutorial and example data .

Once data files have been analysed it is time to investigate them. Please get this info. through the GitHub page. (URL: https://github.com/13479776/statTarget)

3.1 Results of Shift Correction (ShiftCor)

  • The output file:
statTarget -- shiftCor 
-- After_shiftCor # The corrected results including the loplot using statTarget
-- Before_shiftCor # The raw results using statTarget
-- RSDresult # The RSD analysis 
  • The Figures:

Loplot (left): the visible Figure of QC-RLS correction for each peak.

The RSD distribution (right): The relative standard deviation of peaks in the samples and QCs

  • The status log (Example data):
#############################
# Shift Correction function #
#############################

Data File Checking Start..., Time:  Thu Jan  5 18:58:09 2017 

217 Pheno Samples vs 218 Profile samples

The Pheno samples list (*NA, missing data from the Profile File)
  [1] "QC1"              "QC2"              "QC3"              "QC4"             
  [5] "QC5"              "A1"               "A2"               "A3"              
  [9] "A4"               "A5"               "A6"               "A7"              
 [13] "A8"               "A9"               "A10"              "QC6"             
 [17] "A11"              "A12"              "A13"              "A14"             
 [21] "A15"              "B16"              "B17"              "B18"             
 [25] "B19"              "B20"              "QC7"              "B21"             
 [29] "B22"              "B23"              "B24"              "B25"             
 [33] "B26"              "B27"              "B28"              "B29"             
 [37] "B30"              "QC8"              "C31"              "C32"             
 [41] "C33"              "C34"              "C35"              "QC9"             
 [45] "QC10"             "QC11"             "QC12"             "QC13"            
 [49] "C36_120918171155" "C37"              "C38"              "C39"             
 [53] "C40"              "QC14"             "C41"              "C42"             
 [57] "C43"              "C44"              "C45"              "D46"             
 [61] "D47"              "D48"              "D49"              "D50"             
 [65] "QC15"             "D51"              "D52"              "D53"             
 [69] "D54"              "D55"              "D56"              "D57"             
 [73] "D58"              "D59"              "D60"              "QC16"            
 [77] "E61"              "E62"              "E63"              "E64"             
 [81] "E65"              "E66"              "E67"              "E68"             
 [85] "E69"              "E70"              "QC17"             "E71"             
 [89] "E72"              "E73"              "E74"              "E75"             
 [93] "F76"              "F77"              "F78"              "F79"             
 [97] "F80"              "QC18"             "F81"              "F82"             
[101] "F83"              "F84"              "F85"              "F86"             
[105] "F87"              "F88"              "F89"              "F90"             
[109] "QC19"             "QC20"             "QC21"             "QC22"            
[113] "QC23"             "QC24"             "a1"               "a2"              
[117] "a3"               "a4"               "a5"               "a6"              
[121] "a7"               "a8"               "a9"               "a10"             
[125] "QC25"             "a11"              "a12"              "a13"             
[129] "a14"              "a15"              "b16"              "b18"             
[133] "b19"              "b20"              "QC26"             "b21"             
[137] "b22"              "b23"              "b24"              "b25"             
[141] "b26"              "b27"              "b28"              "b29"             
[145] "b30"              "QC27"             "c31"              "c32"             
[149] "c33"              "c34"              "c35"              "QC28"            
[153] "QC29"             "QC31"             "QC32"             "c36"             
[157] "c37"              "c38"              "c39"              "c40"             
[161] "QC33"             "c41"              "c42"              "c43"             
[165] "c44"              "c45"              "d46"              "d47"             
[169] "d48"              "d49"              "d50"              "QC34"            
[173] "d51"              "d52"              "d53"              "d54"             
[177] "d55"              "d56"              "d57"              "d58"             
[181] "d59"              "d60"              "QC35"             "e61"             
[185] "e62"              "e63"              "e64"              "e65"             
[189] "e66"              "e67"              "e68"              "e69"             
[193] "e70"              "QC36"             "e71"              "e72"             
[197] "e73"              "e74"              "e75"              "f76"             
[201] "f77"              "f78"              "f79"              "f80"             
[205] "QC37"             "f81"              "f82"              "f83"             
[209] "f84"              "f85"              "f86"              "f87"             
[213] "f88"              "f89_120921102721" "f90"              "QC38"            
[217] "QC39"            

Warning: The sample size in Profile File is larger than Pheno File! 

Pheno information:
  Class No.
1     1  30
2     2  29
3     3  30
4     4  30
5     5  30
6     6  30
7    QC  38
  Batch No.
1     1 108
2     2 109

Profile information:
                No.
QC and samples  218
Metabolites    1312

statTarget: shiftCor start...Time:  Thu Jan  5 18:58:11 2017 

Step 1: Evaluation of missing value...

The number of NA value in Data Profile before QC-RLSC: 2280

The number of variables including 80 % of missing value : 3

Step 2: Imputation start...

The number of NA value in Data Profile after the initial imputation: 0

Imputation Finished!

Step 3: QC-RLSC Start... Time:  Thu Jan  5 18:58:12 2017

Warning: The QCspan was set at '0'.

The GCV was used to avoid overfitting the observed data

  |===============================================================================| 100%

High-resolution images output...

Calculation of CV distribution of raw peaks (QC)...

            CV<5%    CV<10%   CV<15%   CV<20%   CV<25%   CV<30%   CV<35%   CV<40%
Batch_1 0.6875477  7.944996 23.98778 37.58594 46.98243 54.39267 61.19175 67.99083
Batch_2 4.0488923 25.821238 45.76012 57.44843 64.40031 70.51184 76.39419 80.29030
Total   0.3819710  6.722689 21.08480 33.38426 44.38503 51.87166 59.20550 64.55309
          CV<45%   CV<50%   CV<55%   CV<60%   CV<65%   CV<70%   CV<75%   CV<80%   CV<85%
Batch_1 72.80367 77.92208 80.97785 84.11001 87.16578 88.69366 89.45760 90.67991 91.59664
Batch_2 83.34607 86.40183 88.31169 90.52712 92.58976 93.43010 94.42322 95.64553 96.18029
Total   69.36593 74.56073 78.53323 81.51261 82.96409 85.10313 87.39496 89.53400 91.36746
          CV<90%   CV<95%  CV<100%
Batch_1 92.66616 93.35371 94.57601
Batch_2 96.48587 97.17341 97.40260
Total   92.89534 94.27044 94.95798


Calculation of CV distribution of corrected peaks (QC)...

           CV<5%   CV<10%   CV<15%   CV<20%   CV<25%   CV<30%   CV<35%   CV<40%   CV<45%
Batch_1 18.25821 45.98930 64.40031 72.72727 78.45684 83.72804 86.17265 88.54087 89.76318
Batch_2 20.24446 51.48969 68.06723 78.22765 84.56837 88.23529 90.75630 92.36058 93.50649
Total   15.73720 44.46142 64.62949 73.18564 80.36669 84.79756 87.31856 88.69366 89.68678
          CV<50%   CV<55%   CV<60%   CV<65%   CV<70%   CV<75%   CV<80%   CV<85%   CV<90%
Batch_1 91.06188 91.90222 92.58976 93.04813 93.43010 94.04125 94.65241 95.11077 95.56914
Batch_2 94.11765 94.88159 95.49274 96.18029 96.63866 96.86784 97.09702 97.40260 97.70817
Total   90.75630 91.97861 93.20092 93.96486 94.57601 95.33995 95.87471 96.10390 96.63866
          CV<95%  CV<100%
Batch_1 95.95111 96.02750
Batch_2 98.09015 98.31933
Total   96.71505 97.09702


Correction Finished! Time:  Thu Jan  5 19:00:51 2017

3.2 Results of statistic analysis (statAnalysis)

  • The output file:
statTarget -- statAnalysis 
-- PCA_Data_Pareto # Principal Component Analysis
-- PLS_DA_Pareto # Partial least squares Discriminant Analysis
-- Univariate# The RSD analysis 
   ----- BoxPlot
   ----- Fold_Changes
   ----- Mann-Whitney_Tests # For non-normally distributed variables
   ----- oddratio # odd ratio
   ----- Pvalues # Intergation pvalues from Welch_test and MWT_test 
   ----- RForest # Random Forest
   ----- ROC # receiver operating characteristic curve
   ----- Shapiro_Tests 
   ----- Significant_Variables # The Peaks with P-value < 0.05 
   ----- Volcano_Plots
   ----- WelchTest  # For normally distributed variables
  • The Figures: