zoukankan      html  css  js  c++  java
  • R语言中编写最小工作示例(MWRE)

    原文链接:http://tecdat.cn/?p=6716

    获得R问题的良好帮助的关键是提供最低限度工作的可重复示例(MWRE)。使用R制作MWRE非常简单,它将有助于确保那些帮助您识别错误来源的人,并理想地提交给您,以修复错误,而不是向您发送有用的代码。要拥有MWRE,您需要以下项目:

    • 产生错误的最小数据集
    • 生成数据所需的最小可运行代码,在提供的数据集上运行
    • 有关已使用的软件包,R版本和系统的必要信息
    • 一个seed值,如果随机特性是代码的一部分

    让我们看看R中可用的工具,以帮助我们快速,轻松地创建这些组件。

    生成最小数据集

    这里有三个不同的选项:

    1. 使用内置R数据集
    2. 从头开始创建一个新的vector / data.frame
    3. 以可共享的方式输出您当前正在处理的数据

    让我们依次看看每一个,看看R帮助我们做的工具。

    内置数据集

    R数据集中有一些规范的buit非常适合在帮助请求中使用。

    • mtcars
    •  鸢尾花

    要查看R中的所有可用数据集,只需键入:data()。要加载任何这些数据集,只需使用以下内容:

    data(mtcars)
    head(mtcars)  # to look at the data
    
                       mpg cyl disp  hp drat    wt  qsec vs am gear carb
    Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
    Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
    Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
    Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
    Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
    Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
    

    此选项适用于您知道在R中遇到命令时遇到问题的问题。如果您无法理解为什么您熟悉的命令无法处理数据,则此选项不是一个很好的选择。

     

     data(stulevel)
    names(stulevel)
    
     [1] "X"           "school"      "stuid"       "grade"       "schid"      
     [6] "dist"        "white"       "black"       "hisp"        "indian"     
    [11] "asian"       "econ"        "female"      "ell"         "disab"      
    [16] "sch_fay"     "dist_fay"    "luck"        "ability"     "measerr"    
    [21] "teachq"      "year"        "attday"      "schoolscore" "district"   
    [26] "schoolhigh"  "schoolavg"   "schoollow"   "readSS"      "mathSS"     
    [31] "proflvl"     "race"       
    

    创建自己的数据

     让我们创建一个学生考试成绩和人口统计学的模拟数据框架。

     
    
    head(Data)
    
      id gender mathSS readSS race
    1  1 female  396.6  349.2    H
    2  2   male  369.5  330.7    W
    3  3 female  423.3  354.3    B
    4  4   male  348.7  333.1    W
    5  5   male  299.7  353.4    H
    6  6 female  338.0  422.1    I
    

     我们模拟了学生数据。 让我们使用快速绘图来查看变量之间的关系:

    qplot(mathSS, readSS, data=Data, color=race)+theme_bw()

    它看起来像比赛是相当均匀的分布和存在之间没有任何关系mathSSreadSS。对于某些应用程序,此数据已足够,但对于其他应用程序,我们可能希望获得更实际的数据。

    table(Data$race)
    
    
      A   B   H   I   W 
    192 195 202 203 208 
    
    cor(Data$mathSS, Data$readSS)
    
    [1] -0.01236
    

    输出您当前的数据

     这里的最佳实践是创建您正在处理的数据的子集,然后使用该dput命令输出它。

    dput(head(stulevel, 5))
    
    structure(list(X = c(44L, 53L, 116L, 244L, 274L), school = c(1L, 
    1L, 1L, 1L, 1L), stuid = c(149995L, 13495L, 106495L, 45205L, 
    142705L), grade = c(3L, 3L, 3L, 3L, 3L), schid = c(495L, 495L, 
    495L, 205L, 205L), dist = c(105L, 45L, 45L, 15L, 75L), white = c(0L, 
    0L, 0L, 0L, 0L), black = c(1L, 1L, 1L, 1L, 1L), hisp = c(0L, 
    0L, 0L, 0L, 0L), indian = c(0L, 0L, 0L, 0L, 0L), asian = c(0L, 
    0L, 0L, 0L, 0L), econ = c(0L, 1L, 1L, 1L, 1L), female = c(0L, 
    0L, 0L, 0L, 0L), ell = c(0L, 0L, 0L, 0L, 0L), disab = c(0L, 0L, 
    0L, 0L, 0L), sch_fay = c(0L, 0L, 0L, 0L, 0L), dist_fay = c(0L, 
    0L, 0L, 0L, 0L), luck = c(0L, 1L, 0L, 1L, 0L), ability = c(87.8540493076978, 
    97.7875614875502, 104.493033823157, 111.671512686787, 81.9253913501755
    ), measerr = c(11.1332639734731, 6.8223938284885, -7.85615858883968, 
    -17.5741522573204, 52.9833376218976), teachq = c(39.0902471213577, 
    0.0984819168655733, 39.5388526976972, 24.1161227728637, 56.6806130368238
    ), year = c(2000L, 2000L, 2000L, 2000L, 2000L), attday = c(180L, 
    180L, 160L, 168L, 156L), schoolscore = c(29.2242722609726, 55.9632592971956, 
    55.9632592971956, 55.9632592971956, 55.9632592971956), district = c(3L, 
    3L, 3L, 3L, 3L), schoolhigh = c(0L, 0L, 0L, 0L, 0L), schoolavg = c(1L, 
    1L, 1L, 1L, 1L), schoollow = c(0L, 0L, 0L, 0L, 0L), readSS = c(357.286464546893, 
    263.904581222636, 369.672179143784, 346.595665384202, 373.125445669888
    ), mathSS = c(387.280282915207, 302.572371332695, 365.461432571883, 
    344.496386434725, 441.15810279391), proflvl = structure(c(2L, 
    3L, 2L, 2L, 2L), .Label = c("advanced", "basic", "below basic", 
    "proficient"), class = "factor"), race = structure(c(2L, 2L, 
    2L, 2L, 2L), .Label = c("A", "B", "H", "I", "W"), class = "factor")), .Names = c("X", 
    "school", "stuid", "grade", "schid", "dist", "white", "black", 
    "hisp", "indian", "asian", "econ", "female", "ell", "disab", 
    "sch_fay", "dist_fay", "luck", "ability", "measerr", "teachq", 
    "year", "attday", "schoolscore", "district", "schoolhigh", "schoolavg", 
    "schoollow", "readSS", "mathSS", "proflvl", "race"), row.names = c(NA, 
    5L), class = "data.frame")
    

    生成的代码可以复制并粘贴到R ,它将自动按照描述自动构建数据集。 

     

    匿名化您的数据

    也可能是您想要dput数据的情况,但您希望保持数据内容的匿名性。谷歌搜索提出了一个体面的功能,以实现这一目标:

    anonym <- function(df) {
        if (length(df) > 26) {
                 LETTERS <- c(LETTERS, paste(LETTERS, LETTERS, sep = ""))
            })
        }
     
        level.id.df <- function(df) {
            level.id <- function(i) {
                if (class(df[, i]) == "factor" | class(df[, i]) == "character") {
                       sep = ".")
                } else if (is.numeric(df[, i])) {
                 } else {
                    column <- df[, i]
                }
                return(column)
            }
            DF <- data.frame(sapply(seq_along(df), level.id))
             return(DF)
        }
        df <- level.id.df(df)
        return(df)
    }
    
    test <- anonym(stulevel)
    head(test[, c(2:6, 28:32)])
    
                        B                 C                 D
    1 0.00217632592657076  1.51160611230132 0.551020408163265
    2 0.00217632592657076 0.135998696526593 0.551020408163265
    3 0.00217632592657076  1.07322572705443 0.551020408163265
    4 0.00217632592657076 0.455562880806568 0.551020408163265
    5 0.00217632592657076  1.43813960635994 0.551020408163265
    6 0.00217632592657076 0.151115261535106 0.551020408163265
     

     

    创建示例

    一旦我们得到了最小的数据集,我们就需要在该数据集上重现我们的错误。 

    让我们看一个聚合数据的错误示例。 

    Data <- data.frame(id = seq(1, 1000), gender = sample(c("male", "female"), 1000, 
        replace = TRUE), mathSS = rnorm(1000, mean = 400, sd = 60), readSS = rnorm(1000, 
        mean = 370, sd = 58.3), race = sample(c("H", "B", "W", "I", "A"), 1000, 
        replace = TRUE))
    
    myAgg <- Data[, list(meanM = mean(mathSS)), by = race]
    
    Error: unused argument(s) (by = race)
    
    head(myAgg)
    
    Error: object 'myAgg' not found
    

    为什么我会收到错误? 如果您将上述代码发送给某人,他们可以快速评估错误,如果他们知道您正在尝试使用data.table包,请查看错误。

    library(data.table)
    Data <- data.frame(id = seq(1, 1000), gender = sample(c("male", "female"), 1000, 
        replace = TRUE), mathSS = rnorm(1000, mean = 400, sd = 60), readSS = rnorm(1000, 
        mean = 370, sd = 58.3), race = sample(c("H", "B", "W", "I", "A"), 1000, 
        replace = TRUE))
    
    Data <- data.table(Data)
    myAgg <- Data[, list(meanM = mean(mathSS)), by = race]
    head(myAgg)
    
       race meanM
    1:    H 398.6
    2:    B 405.1
    3:    A 397.8
    4:    W 395.7
    5:    I 399.1
    

    会话信息

    但是,他们可能不知道这一点,所以我们需要提供最后一条信息。 要诊断错误,必须知道您正在运行的系统,工作区中加载了哪些软件包,以及您使用的R版本和给定软件包。

     只需添加sessionInfo()功能的输出 。这很容易复制和粘贴或包含在knitr文档中。

    sessionInfo()
    
    R version 2.15.2 (2012-10-26)
    Platform: x86_64-w64-mingw32/x64 (64-bit)
    
    locale:
    [1] LC_COLLATE=English_United States.1252 
    [2] LC_CTYPE=English_United States.1252   
    [3] LC_MONETARY=English_United States.1252
    [4] LC_NUMERIC=C                          
    [5] LC_TIME=English_United States.1252    
    
    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     
    
    other attached packages:
    [1] data.table_1.8.8 eeptools_0.2     ggplot2_0.9.3.1  knitr_1.2       
    
    loaded via a namespace (and not attached):
     [1] colorspace_1.2-2   dichromat_2.0-0    digest_0.6.3      
     [4] evaluate_0.4.3     formatR_0.7        grid_2.15.2       
     [7] gtable_0.1.2       labeling_0.1       MASS_7.3-23       
    [10] munsell_0.4        plyr_1.8           proto_0.3-10      
    [13] RColorBrewer_1.0-5 reshape2_1.2.2     scales_0.2.3      
    [16] stringr_0.6.2      tools_2.15.2  

    如果您有任何疑问,请在下面发表评论。

  • 相关阅读:
    Lua 语言环境安装
    python __init__.py 的作用
    python functiontools模块中的 wraps
    函数装饰器、类装饰器
    redis pipline 和 事务
    Innodb中的行锁与表锁
    mongoengine 分页 切片与 skip + limit 的区别
    python 可变类型传的内存地址, 不可变类型在传递的时候传的是值
    PHP常用人工智能库
    PHP批量给目录下所有的文件转换编码
  • 原文地址:https://www.cnblogs.com/tecdat/p/10684684.html
Copyright © 2011-2022 走看看