zoukankan      html  css  js  c++  java
  • Weka数据挖掘——选择属性

    如果你现在还不努力,那么将来的你会过的更加吃力。

    1 选择属性

    属性选择是通过搜索数据中所有可能的属性组合,以找到预测效果最好的属性子集。手工选择属性既繁琐又容易出错,为了帮助用户事项选择属性自动化。Weka中提供了选择属性面板。要自动选择属性需要设立两个对象:属性评估器和搜索方法,如下图所示:
    选择属性面板
    属性评估器确定使用什么方法给每个属性分配一个评估值,搜索方法决定执行什么风格的搜索。

    2 选择属性算法的介绍

    2-1 属性子集评估器

    属性子集评估器选取属性的一个子集,并且返回一个指导搜索的度量数值。
    CfsSubsetEval评估器评估每个属性的预测能力以及相互之间的冗余度,倾向于选择与类别属性相关度高,但是相互之间相关度第的属性。选项迭代添加与类别属性相关度最高的属性,只要是子集中不包含与当前属性相关更高的属性。 评估器将缺失值作为单独值,也可以将缺失值计数与其他的值一起按照出现频率分布。
    WrapperSubsetEval评估器是包装器方法。它使用一个分类器来评估属性集,它对每个子集采用交叉验证估计学习方案的准确性。

    2-2 单个属性评估器

    单个属性评估器和Ranker搜索方法一起使用,Ranker产生一个丢弃若干属性后得到的给定数目的属性列表。
    ReliefAttributeEval是基于实例的评估器,它随机抽取样本,并检查具有相同和不同类别的邻近实例。它可以运行在离散型和连续性的数据之上,参数包括指定抽样实例的数量,要检查的临近实例的数量,是否对近邻的距离加权,以及控制权重如何根据距离衰减的指数函数。

    InfoGainAttributeEval评估器是通过测量类别对应属性的信息增益来评估属性,它首相基于MDL(最小描述长度)的离散化方法(也可以设置二元化处理)对数值属性惊醒离散化。
    GainRatioAttributeEval评估器通过测量相应类别的增益率来评估属性。

    其他的在使用的时候在研究………………

    2-3 搜索方法

    搜索方法遍历属性空间以搜索好的子集,通过所选的属性子集评估器来衡量其质量。
    BestFirst搜索方法执行带回溯的贪婪爬山法,用户可以指定在系统的回溯钱,必须连续遇到多少个无法改善的结点。它可以从空属性集开始向前搜索,也可以从全集可是向后搜索,也可以从中间点开始双向搜索(增删单个属性)。为了提高效率可以缓存已经评估的子集。
    GreedyStepwise搜索方法贪婪搜索属性的子集空间。不会进行回溯。
    Ranker对单个属性进行排名的方案。

    3 Weka选择属性实例分析

    选择属性的一般目的是为了更好的实现分类功能,因为属性和最终需要分类的目标属性的关联度是不一样的。

    使用劳工数据集labor.arff
    CfsSubsetEval
    配置

    === Run information ===
    
    Evaluator:    weka.attributeSelection.CfsSubsetEval -P 1 -E 1
    Search:       weka.attributeSelection.GreedyStepwise -T -1.7976931348623157E308 -N -1 -num-slots 1
    Relation:     labor-neg-data
    Instances:    57
    Attributes:   17
                  duration
                  wage-increase-first-year
                  wage-increase-second-year
                  wage-increase-third-year
                  cost-of-living-adjustment
                  working-hours
                  pension
                  standby-pay
                  shift-differential
                  education-allowance
                  statutory-holidays
                  vacation
                  longterm-disability-assistance
                  contribution-to-dental-plan
                  bereavement-assistance
                  contribution-to-health-plan
                  class
    Evaluation mode:    evaluate on all training data
    
    
    
    === Attribute Selection on all input data ===
    
    Search Method:
        Greedy Stepwise (forwards).
        Start set: no attributes
        Merit of best subset found:    0.363
    
    Attribute Subset Evaluator (supervised, Class (nominal): 17 class):
        CFS Subset Evaluator
        Including locally predictive attributes
    
    Selected attributes: 2,3,5,11,12,13,14 : 7
                         wage-increase-first-year
                         wage-increase-second-year
                         cost-of-living-adjustment
                         statutory-holidays
                         vacation
                         longterm-disability-assistance
                         contribution-to-dental-plan
    
    

    WrapperSubsetEval评估器

    配置

    === Run information ===
    
    Evaluator:    weka.attributeSelection.WrapperSubsetEval -B weka.classifiers.trees.J48 -F 5 -T 0.01 -R 1 -E DEFAULT -- -C 0.25 -M 2
    Search:       weka.attributeSelection.BestFirst -D 1 -N 5
    Relation:     labor-neg-data
    Instances:    57
    Attributes:   17
                  duration
                  wage-increase-first-year
                  wage-increase-second-year
                  wage-increase-third-year
                  cost-of-living-adjustment
                  working-hours
                  pension
                  standby-pay
                  shift-differential
                  education-allowance
                  statutory-holidays
                  vacation
                  longterm-disability-assistance
                  contribution-to-dental-plan
                  bereavement-assistance
                  contribution-to-health-plan
                  class
    Evaluation mode:    evaluate on all training data
    
    
    
    === Attribute Selection on all input data ===
    
    Search Method:
        Best first.
        Start set: no attributes
        Search direction: forward
        Stale search after 5 node expansions
        Total number of subsets evaluated: 138
        Merit of best subset found:    0.842
    
    Attribute Subset Evaluator (supervised, Class (nominal): 17 class):
        Wrapper Subset Evaluator
        Learning scheme: weka.classifiers.trees.J48
        Scheme options: -C 0.25 -M 2 
        Subset evaluation: classification accuracy
        Number of folds for accuracy estimation: 5
    
    Selected attributes: 1,2,4,6,11,12 : 6
                         duration
                         wage-increase-first-year
                         wage-increase-third-year
                         working-hours
                         statutory-holidays
                         vacation
    
    
    

    研究对比:使用J48分类器,十折交叉验证来比较GfsSubsetEval评估器和WrapperSubsetEval评估器。
    直接全集使用

    === Run information ===
    
    Scheme:       weka.classifiers.trees.J48 -C 0.25 -M 2
    Relation:     labor-neg-data
    Instances:    57
    Attributes:   17
                  duration
                  wage-increase-first-year
                  wage-increase-second-year
                  wage-increase-third-year
                  cost-of-living-adjustment
                  working-hours
                  pension
                  standby-pay
                  shift-differential
                  education-allowance
                  statutory-holidays
                  vacation
                  longterm-disability-assistance
                  contribution-to-dental-plan
                  bereavement-assistance
                  contribution-to-health-plan
                  class
    Test mode:    10-fold cross-validation
    
    === Classifier model (full training set) ===
    
    J48 pruned tree
    ------------------
    
    wage-increase-first-year <= 2.5: bad (15.27/2.27)
    wage-increase-first-year > 2.5
    |   statutory-holidays <= 10: bad (10.77/4.77)
    |   statutory-holidays > 10: good (30.96/1.0)
    
    Number of Leaves  :     3
    
    Size of the tree :  5
    
    
    Time taken to build model: 0.04 seconds
    
    === Stratified cross-validation ===
    === Summary ===
    
    Correctly Classified Instances          42               73.6842 %
    Incorrectly Classified Instances        15               26.3158 %
    Kappa statistic                          0.4415
    Mean absolute error                      0.3192
    Root mean squared error                  0.4669
    Relative absolute error                 69.7715 %
    Root relative squared error             97.7888 %
    Coverage of cases (0.95 level)          91.2281 %
    Mean rel. region size (0.95 level)      85.9649 %
    Total Number of Instances               57     
    
    === Detailed Accuracy By Class ===
    
                     TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                     0.700    0.243    0.609      0.700    0.651      0.444    0.695     0.559     bad
                     0.757    0.300    0.824      0.757    0.789      0.444    0.695     0.738     good
    Weighted Avg.    0.737    0.280    0.748      0.737    0.740      0.444    0.695     0.675     
    
    === Confusion Matrix ===
    
      a  b   <-- classified as
     14  6 |  a = bad
      9 28 |  b = good
    
    

    使用Cfs的结果,首先过滤属性

    === Run information ===
    
    Scheme:       weka.classifiers.trees.J48 -C 0.25 -M 2
    Relation:     labor-neg-data-weka.filters.unsupervised.attribute.Remove-R1,4,6-10,15-16
    Instances:    57
    Attributes:   8
                  wage-increase-first-year
                  wage-increase-second-year
                  cost-of-living-adjustment
                  statutory-holidays
                  vacation
                  longterm-disability-assistance
                  contribution-to-dental-plan
                  class
    Test mode:    10-fold cross-validation
    
    === Classifier model (full training set) ===
    
    J48 pruned tree
    ------------------
    
    wage-increase-first-year <= 2.5: bad (15.27/2.27)
    wage-increase-first-year > 2.5
    |   longterm-disability-assistance = yes
    |   |   statutory-holidays <= 10
    |   |   |   wage-increase-first-year <= 3: bad (2.0)
    |   |   |   wage-increase-first-year > 3: good (3.99)
    |   |   statutory-holidays > 10: good (25.67)
    |   longterm-disability-assistance = no
    |   |   vacation = below_average: bad (5.09/1.09)
    |   |   vacation = average: good (2.64/1.0)
    |   |   vacation = generous: good (2.34)
    
    Number of Leaves  :     7
    
    Size of the tree :  12
    
    
    Time taken to build model: 0 seconds
    
    === Stratified cross-validation ===
    === Summary ===
    
    Correctly Classified Instances          44               77.193  %
    Incorrectly Classified Instances        13               22.807  %
    Kappa statistic                          0.4935
    Mean absolute error                      0.2787
    Root mean squared error                  0.441 
    Relative absolute error                 60.9191 %
    Root relative squared error             92.3655 %
    Coverage of cases (0.95 level)          89.4737 %
    Mean rel. region size (0.95 level)      78.0702 %
    Total Number of Instances               57     
    
    === Detailed Accuracy By Class ===
    
                     TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                     0.650    0.162    0.684      0.650    0.667      0.494    0.737     0.586     bad
                     0.838    0.350    0.816      0.838    0.827      0.494    0.733     0.777     good
    Weighted Avg.    0.772    0.284    0.770      0.772    0.771      0.494    0.735     0.710     
    
    === Confusion Matrix ===
    
      a  b   <-- classified as
     13  7 |  a = bad
      6 31 |  b = good
    
    

    使用Wrap结果,首先过滤属性

    === Run information ===
    
    Scheme:       weka.classifiers.trees.J48 -C 0.25 -M 2
    Relation:     labor-neg-data-weka.filters.unsupervised.attribute.Remove-R3,5,7-10,13-16
    Instances:    57
    Attributes:   7
                  duration
                  wage-increase-first-year
                  wage-increase-third-year
                  working-hours
                  statutory-holidays
                  vacation
                  class
    Test mode:    10-fold cross-validation
    
    === Classifier model (full training set) ===
    
    J48 pruned tree
    ------------------
    
    wage-increase-first-year <= 2.5: bad (15.27/2.27)
    wage-increase-first-year > 2.5
    |   statutory-holidays <= 10
    |   |   vacation = below_average: bad (7.54/1.54)
    |   |   vacation = average: bad (0.0)
    |   |   vacation = generous: good (3.23)
    |   statutory-holidays > 10: good (30.96/1.0)
    
    Number of Leaves  :     5
    
    Size of the tree :  8
    
    
    Time taken to build model: 0 seconds
    
    === Stratified cross-validation ===
    === Summary ===
    
    Correctly Classified Instances          46               80.7018 %
    Incorrectly Classified Instances        11               19.2982 %
    Kappa statistic                          0.5905
    Mean absolute error                      0.2593
    Root mean squared error                  0.4162
    Relative absolute error                 56.6868 %
    Root relative squared error             87.1592 %
    Coverage of cases (0.95 level)          92.9825 %
    Mean rel. region size (0.95 level)      78.9474 %
    Total Number of Instances               57     
    
    === Detailed Accuracy By Class ===
    
                     TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                     0.800    0.189    0.696      0.800    0.744      0.594    0.775     0.608     bad
                     0.811    0.200    0.882      0.811    0.845      0.594    0.775     0.808     good
    Weighted Avg.    0.807    0.196    0.817      0.807    0.810      0.594    0.775     0.738     
    
    === Confusion Matrix ===
    
      a  b   <-- classified as
     16  4 |  a = bad
      7 30 |  b = good
    
    

    总结:
    第一:经过属性选择之后,分类的准确度得到提高;
    第二:对于本例Wrap由于Cfs

    踏实 踏踏实实~
  • 相关阅读:
    nsq main里面golang多个协程管理写法
    关于centos压测的问题
    Jaeger全链路go实现,包含http和消息队列的链式传递
    mstsc远程报:这可能是由于CredSSP 加密Oracle修正的两种完美解决方法
    Axure9:一键复制Iconfont图标到Axure的插件安装
    Axure9:导入阿里云DataV.GeoAtlas矢量地图到Axure
    Axure:编辑技巧:拖动复制、编组、组内选中、穿透选中、选中模式
    Axure学习一:打开软件,预览和生成HTML原型、安装Chrome插件
    Axure RP授权码
    vm 安装rethat和centos
  • 原文地址:https://www.cnblogs.com/mrzhang123/p/5365811.html
Copyright © 2011-2022 走看看