在生信分析过程中，尤其是转录组分析中，经常会遇到测得数据不足，需要利用公共数据库中已有的数据，那么能将这些数据直接和测序的数据混合吗？如果贸然混合，就会存在批次效应，请问R语言中哪些包可以处理批次效应呢？
ComBat是基于经典贝叶斯的分析方法，运用已知的批次信息对高通量数据进行批次校正。在sva R package 中提供了ComBat用于处理批次效应。ComBat有两个方法可供选择，一种是基于参数和一种非参数方法，combat函数的par.prior参数可以设置。函数输入数据为经过标准化的数据矩阵，返回结果为经过批次校正后的一个数据矩阵。
source("http://www.bioconductor.org/biocLite.R")
biocLite("sva")
biocLite("bladderbatch")
library(sva)
library(bladderbatch)
data(bladderdata)
pheno = pData(bladderEset)
edata = exprs(bladderEset)
batch = pheno$batch
modcombat = model.matrix(~1, data=pheno)
combat_edata = ComBat(dat=edata, batch=batch, mod=modcombat, par.prior=TRUE, prior.plot=FALSE)
另外，ber R package同样可以校正批次效应。同样运用上面的bladder cancer数据。在这个R包中提供了六个函数来进行批次校正，ber函数，ber_bg, combat_np, combat_p，mean_centering，standardization函数。对于这六个函数来说，输入数据行对应的是样本，列对应的是变量。
dat = t(edata)
class = data.frame(pheno$cancer)
batch = as.factor(pheno$batch)
ber_edata1 = ber(dat, batch, class)
ber_edata2 = ber_bg(dat, batch, class)
ber_edata3 = combat_np(dat, batch, class)
ber_edata4 = combat_p(dat, batch, class)
这个问题其实在10年有过一篇文章讲过这个事情，从统计学上说就是如何平衡组内和组间的效应。R语言中的sva包是可以之间处理的。除此之外，推荐给大家一个在线网站。http://www.itl.nist.gov/div898/handbook/eda/section4/eda42a3.htm。
| ||
Batch is a Nuisance Factor | The two nuisance factors in this experiment are the batch number and the lab. There are two batches and eight labs. Ideally, these factors will have minimal effect on the response variable. We will investigate the batch factor first. | |
Bihistogram | This bihistogram shows the following.
Although we could stop with the bihistogram, we will show a few other commonly used two-sample graphical techniques for comparison. | |
Quantile-Quantile Plot | This q-q plot shows the following.
| |
Box Plot | This box plot shows the following.
| |
Block Plots | A block plot is generated for each of the eight labs, with "1" and "2" denoting the batch numbers. In the first plot, we do not include any of the primary factors. The next 3 block plots include one of the primary factors. Note that each of the 3 primary factors (table speed = X1, down feed rate = X2, wheel grit size = X3) has 2 levels. With 8 labs and 2 levels for the primary factor, we would expect 16 separate blocks on these plots. The fact that some of these blocks are missing indicates that some of the combinations of lab and primary factor are empty. These block plots show the following.
| |
Quantitative Techniques | We can confirm some of the conclusions drawn from the above graphics by using quantitative techniques. The F-test can be used to test whether or not the variances from the two batches are equal and thetwo sample t-test can be used to test whether or not the means from the two batches are equal. Summary statistics for each batch are shown below.Batch 1: NUMBER OF OBSERVATIONS = 240 MEAN = 688.9987 STANDARD DEVIATION = 65.5491 VARIANCE = 4296.6845 Batch 2: NUMBER OF OBSERVATIONS = 240 MEAN = 611.1559 STANDARD DEVIATION = 61.8543 VARIANCE = 3825.9544 | |
F-Test | The two-sided F-test indicates that the variances for the two batches are not significantly different at the 5 % level.H0: σ12 = σ22 Ha: σ12 ≠ σ22 Test statistic: F = 1.123 Numerator degrees of freedom: ν1 = 239 Denominator degrees of freedom: ν2 = 239 Significance level: α = 0.05 Critical values: F1-α/2,ν1,ν2 = 0.845 Fα/2,ν1,ν2 = 1.289 Critical region: Reject H0 if F < 0.845 or F > 1.289 | |
Two Sample t-Test | Since the F-test indicates that the two batch variances are equal, we can pool the variances for the two-sided, two-sample t-test to compare batch means.H0: μ1 = μ2 Ha: μ1 ≠ μ2 Test statistic: T = 13.3806 Pooled standard deviation: sp = 63.7285 Degrees of freedom: ν = 478 Significance level: α = 0.05 Critical value: t1-α/2,ν = 1.965 Critical region: Reject H0 if |T| > 1.965The t-test indicates that the mean for batch 1 is larger than the mean for batch 2 at the 5 % significance level. | |
Conclusions | We can draw the following conclusions from the above analysis.
This batch effect was completely unexpected by the scientific investigators in this study. Note that although the quantitative techniques support the conclusions of unequal means and equal standard deviations, they do not show the more subtle features of the data such as the presence of outliers and the skewness of the batch 2 data. |