zoukankan      html  css  js  c++  java
  • R语言:常用统计检验

    统计检验是将抽样结果和抽样分布相对照而作出判断的工作。主要分5个步骤:

    1. 建立假设
    2. 求抽样分布
    3. 选择显著性水平和否定域
    4. 计算检验统计量
    5. 判定 —— 百度百科

    假设检验(hypothesis test)亦称显著性检验(significant test),是统计推断的另一重要内容,其目的是比较总体参数之间有无差别。假设检验的实质是判断观察到的“差别”是由抽样误差引起还是总体上的不同,目的是评价两种不同处理引起效应不同的证据有多强,这种证据的强度用概率P来度量和表示。除t分布外,针对不同的资料还有其他各种检验统计量及分布,如F分布、X2分布等,应用这些分布对不同类型的数据进行假设检验的步骤相同,其差别仅仅是需要计算的检验统计量不同。

    正态总体均值的假设检验

    t检验

    t.test() => Student's t-Test

    require(graphics)
    
    t.test(1:10, y = c(7:20))      # P = .00001855
    t.test(1:10, y = c(7:20, 200)) # P = .1245    -- 不在显著
    
    
    ## 经典案例: 学生犯困数据
    plot(extra ~ group, data = sleep)
    
    

    ## 传统表达式
    with(sleep, t.test(extra[group == 1], extra[group == 2]))
    
    	Welch Two Sample t-test
    
    data:  extra[group == 1] and extra[group == 2]
    t = -1.8608, df = 17.776, p-value = 0.07939
    alternative hypothesis: true difference in means is not equal to 0
    95 percent confidence interval:
     -3.3654832  0.2054832
    sample estimates:
    mean of x mean of y 
         0.75      2.33 
    
    ## 公式形式
    t.test(extra ~ group, data = sleep)
    
    	Welch Two Sample t-test
    
    data:  extra by group
    t = -1.8608, df = 17.776, p-value = 0.07939
    alternative hypothesis: true difference in means is not equal to 0
    95 percent confidence interval:
     -3.3654832  0.2054832
    sample estimates:
    mean in group 1 mean in group 2 
               0.75            2.33 
    
    

    单个总体

    • 某种元件的寿命X(小时)服从正态分布N(mu,sigma^2),其中mu、sigma^2均未知,16只元件的寿命如下;问是否有理由认为元件的平均寿命大于255小时。
    X<-c(159, 280, 101, 212, 224, 379, 179, 264,
    222, 362, 168, 250, 149, 260, 485, 170)
    t.test(X, alternative = "greater", mu = 225)
    
    	One Sample t-test
    
    data:  X
    t = 0.66852, df = 15, p-value = 0.257
    alternative hypothesis: true mean is greater than 225
    95 percent confidence interval:
     198.2321      Inf
    sample estimates:
    mean of x 
        241.5 
    
    

    两个总体

    • X为旧炼钢炉出炉率,Y为新炼钢炉出炉率,问新的操作能否提高出炉率?
    X<-c(78.1,72.4,76.2,74.3,77.4,78.4,76.0,75.5,76.7,77.3)
    Y<-c(79.1,81.0,77.3,79.1,80.0,79.1,79.1,77.3,80.2,82.1)
    t.test(X, Y, var.equal=TRUE, alternative = "less")
    
    	Two Sample t-test
    
    data:  X and Y
    t = -4.2957, df = 18, p-value = 0.0002176
    alternative hypothesis: true difference in means is less than 0
    95 percent confidence interval:
          -Inf -1.908255
    sample estimates:
    mean of x mean of y 
        76.23     79.43 
    

    成对数据t检验

    • 对每个高炉进行配对t检验
    X<-c(78.1,72.4,76.2,74.3,77.4,78.4,76.0,75.5,76.7,77.3)
    Y<-c(79.1,81.0,77.3,79.1,80.0,79.1,79.1,77.3,80.2,82.1)
    t.test(X-Y, alternative = "less")
    
    	One Sample t-test
    
    data:  X - Y
    t = -4.2018, df = 9, p-value = 0.00115
    alternative hypothesis: true mean is less than 0
    95 percent confidence interval:
          -Inf -1.803943
    sample estimates:
    mean of x 
         -3.2 
    

    正态总体方差的假设检验

    var.test() => F Test to Compare Two Variances

    x <- rnorm(50, mean = 0, sd = 2)
    y <- rnorm(30, mean = 1, sd = 1)
    var.test(x, y)                  # x和y的方差是否相同?
    var.test(lm(x ~ 1), lm(y ~ 1))  # 相同.
    
    
    • 从小学5年级男生中抽取20名,测量其身高(厘米)如下;问:在0.05显著性水平下,平均值是否等于149,sigma^2是否等于75?
    X<-scan()
    136 144 143 157 137 159 135 158 147 165
    158 142 159 150 156 152 140 149 148 155
    var.test(X,Y)
    
    	F test to compare two variances
    
    data:  X and Y
    F = 34.945, num df = 19, denom df = 9, p-value = 6.721e-06
    alternative hypothesis: true ratio of variances is not equal to 1
    95 percent confidence interval:
       9.487287 100.643093
    sample estimates:
    ratio of variances 
              34.94489 
    
    
    • 对炼钢炉的数据进行分析
    X<-c(78.1,72.4,76.2,74.3,77.4,78.4,76.0,75.5,76.7,77.3)
    Y<-c(79.1,81.0,77.3,79.1,80.0,79.1,79.1,77.3,80.2,82.1)
    var.test(X,Y)
    
    	F test to compare two variances
    
    data:  X and Y
    F = 1.4945, num df = 9, denom df = 9, p-value = 0.559
    alternative hypothesis: true ratio of variances is not equal to 1
    95 percent confidence interval:
     0.3712079 6.0167710
    sample estimates:
    ratio of variances 
              1.494481 
    
    
    

    二项分布的总体检验

    • 有一批蔬菜种子的平均发芽率为P=0.85,现在随机抽取500粒,用种衣剂进行浸种处理,结果有445粒发芽,问种衣剂有无效果。
    binom.test(445,500,p=0.85)
    
    	Exact binomial test
    
    data:  445 and 500
    number of successes = 445, number of trials = 500, p-value = 0.01207
    alternative hypothesis: true probability of success is not equal to 0.85
    95 percent confidence interval:
     0.8592342 0.9160509
    sample estimates:
    probability of success 
                      0.89 
    
    
    • 按照以往经验,新生儿染色体异常率一般为1%,某医院观察了当地400名新生儿,有一例染色体异常,问该地区新生儿染色体是否低于一般水平?
    
    binom.test(1,400,p=0.01,alternative="less")
    
    	Exact binomial test
    
    data:  1 and 400
    number of successes = 1, number of trials = 400, p-value = 0.09048
    alternative hypothesis: true probability of success is less than 0.01
    95 percent confidence interval:
     0.0000000 0.0118043
    sample estimates:
    probability of success 
                    0.0025 
    
    

    非参数检验

    数据是否正态分布的Neyman-Pearson 拟合优度检验-chisq

    • 5种品牌啤酒爱好者的人数如下
      A 210
      B 312
      C 170
      D 85
      E 223
      问不同品牌啤酒爱好者人数之间有没有差异?
    X<-c(210, 312, 170, 85, 223)
    chisq.test(X)
    
    	Chi-squared test for given probabilities
    
    data:  X
    X-squared = 136.49, df = 4, p-value < 2.2e-16
    
    • 检验学生成绩是否符合正态分布
    X<-scan()
    25 45 50 54 55 61 64 68 72 75 75
    78 79 81 83 84 84 84 85 86 86 86
    87 89 89 89 90 91 91 92 100
    A<-table(cut(X, br=c(0,69,79,89,100)))
    #cut 将变量区域划分为若干区间
    #table 计算因子合并后的个数
    
    p<-pnorm(c(70,80,90,100), mean(X), sd(X))
    p<-c(p[1], p[2]-p[1], p[3]-p[2], 1-p[3])
    chisq.test(A,p=p)
    
    	Chi-squared test for given probabilities
    
    data:  A
    X-squared = 8.334, df = 3, p-value = 0.03959
    #均值之间有无显著区别
    

    大麦的杂交后代芒性状的比例 无芒:长芒: 短芒=9:3:4,而实际观测值为335:125:160 ,检验观测值是否符合理论假设?

    chisq.test(c(335, 125, 160), p=c(9,3,4)/16)
    
    	Chi-squared test for given probabilities
    
    data:  c(335, 125, 160)
    X-squared = 1.362, df = 2, p-value = 0.5061
    
    • 现有42个数据,分别表示某一时间段内电话总机借到呼叫的次数,
      接到呼叫的次数 0   1   2   3   4   5   6
      出现的频率     7   10  12  8   3   2   0
      问:某个时间段内接到的呼叫次数是否符合Possion分布?
    x<-0:6
    y<-c(7,10,12,8,3,2,0)
    mean<-mean(rep(x,y))
    q<-ppois(x,mean)
    n<-length(y)
    p[1]<-q[1]
    p[n]<-1-q[n-1]
    for(i in 2:(n-1))
      p[i]<-1-q[i-1]
    chisq.test(y, p= rep(1/length(y), length(y)) )
    
    	Chi-squared test for given probabilities
    
    data:  y
    X-squared = 19.667, df = 6, p-value = 0.003174
    
    Z<-c(7, 10, 12, 8)
    n<-length(Z); p<-p[1:n-1]; p[n]<-1-q[n-1]
    chisq.test(Z, p= rep(1/length(Z), length(Z)))
    
    Chi-squared test for given probabilities
    
    data:  Z
    X-squared = 1.5946, df = 3, p-value = 0.6606
    
    

    P值越小越有理由拒绝无效假设,认为总体之间有差别的统计学证据越充分。需要注意:不拒绝H0不等于支持H0成立,仅表示现有样本信息不足以拒绝H0。
    传统上,通常将P>0.05称为“不显著”,0.0l<P≤0.05称为“显著”,P≤0.0l称为“非常显著”。

    注:本文参考来自张金龙科学网博客。

    反馈与建议

  • 相关阅读:
    卡牌配对
    SNOI2017 礼物
    【BZOJ2893】征服王
    景中人
    钦点
    杨柳
    兼容IE与firefox、chrome的css 线性渐变(linear-gradient)
    使用C# DES解密java DES加密的字符串
    jQuery插件autoComplete使用
    hadoop SQL使用
  • 原文地址:https://www.cnblogs.com/shangfr/p/5905721.html
Copyright © 2011-2022 走看看