选择性剪切分析之original统计方法篇

· Hypergeometric 超几何检验，是适应于类似单细胞数据中的高纬度数据的相似性相关性检验，更多的找出高表达与不表达基因之间的差异。亦可以用于ratio的值。

· Kruskal-Wallis rank sum test 是大于三组的时候使用的秩和检验，类似参数检验中的anova，同样其中的pvalue描述的样本之间整体的差异显著性。

· Wald test：判断两个regression模型的相似性（其中的相似性利用regression function的系数beta比较）具体来说在很多生物信息分析二代测序差异基因（DESeq2、limma）的时候就用的不同design得到的regression function的用wald test检验后得到的结果。

In mathematical statistics, the Fisher information(sometimes simply called information) is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information（参数分布分数的方差，或者是观察信息的期望值）. In Bayesian statistics, the asymptotic distribution of the posterior mode（后验模型的渐进分布取决于费舍尔信息值） depends on the Fisher information and not on the prior（而不是先验分布） (according to the Bernstein–von Mises theorem, which was anticipated by Laplace for exponential families（拉普拉斯指数家族簇）). The role of the Fisher information in the asymptotic theory of maximum-likelihood estimation was emphasized by the statistician Ronald Fisher (following some initial results by Francis Ysidro Edgeworth). The Fisher information is also used in the calculation of the Jeffreys prior, which is used in Bayesian statistics（fisher理论也不仅仅只用在最大似然参数的渐进估计中，同时还在贝叶斯估计中计算先验值）。

$I( heta) = E[S(X; heta)^2]-E[S(X; heta)]^2 = Var[S(X; heta)]$
于是得到了Fisher Information的第一条数学意义：就是用来估计MLE的方程的方差。它的直观表述就是，随着收集的数据越来越多，这个方差由于是一个Independent sum的形式，也就变的越来越大，也就象征着得到的信息越来越多。

在DSU方法中最核心的是fit一个向量线性方程（VGAM、MGLM packages）得到选择性剪切事件的PSI值，类似的有bioconductor上的DRIMseq以及JunctonSeq。而我们在R流程包中使用的就是betabinomial （aod）以及dirichlet regression function（MGLM）拟合。关键之处在于利用了confounding factor 作为协变量（such as cell type 、library type、size factor的差异）加入到多变量的线性回归方程中。使得拟合结果不仅仅收敛于输入值的概率值同时约束于协变量的水平个数甚至维度。

Chisquare test obtain pvalue。以及cell specific事件的定义时，用的Wilcox test的pvalue以及delta PSI、PSI mean筛选。