zoukankan      html  css  js  c++  java
  • Weka:call for the EM algorithm to achieve clustering.(EM算法)

    EM算法:

    在Eclipse中写出读取文件的代码然后调用EM算法计算输出结果:

    package EMAlg;
    import java.io.*;
    
    import weka.core.*;
    import weka.filters.Filter;
    import weka.filters.unsupervised.attribute.Remove;
    import weka.clusterers.*;
    public class EMAlg {
    
        public EMAlg() {
            // TODO Auto-generated constructor stub
            System.out.println("this is the EMAlg");
        }
    
        public static void main(String[] args) throws Exception {
            // TODO Auto-generated method stub
            String file="C:\Program Files/DataMining/Weka-3-6-10/data/labor.arff";
            FileReader FReader=new FileReader(file);
            BufferedReader Reader= new BufferedReader(FReader);
            
            Instances data=new Instances(Reader);
            data.setClassIndex(data.numAttributes()-1);//设置最后一个属性作为分类属性
            
            Remove filter=new Remove();
            System.out.println("''+data.classIndex()的输出内容是:"+""+data.classIndex());
            System.out.println("读取数据的属性个数一共有:"+data.numAttributes()+"个.");
            filter.setAttributeIndices(""+(data.classIndex()+1));
            /*filter.setAttributeIndices();
             * Set which attributes are to be deleted (or kept if invert is true)
             * 用来设置哪一个属性应该被删除的方法。
             * Parameters:
             *     rangeList - a string representing the list of attributes. Since the string will typically come from a user, attributes are indexed from 1.
             *     eg: first-3,5,6-last
             */
            filter.setInputFormat(data);
            /*
             * public boolean setInputFormat(Instances instanceInfo)throws java.lang.Exception
             * Sets the format of the input instances(设置输入数据的格式). If the filter is able to determine the output format before seeing any input instances, it does so here(如果过滤器在查看任何输入文件之前可以决定 输入文件的格式,那么这个函数就放在这里). 
             * This default implementation clears the output format and output queue, and the new batch flag is set. 
             * Overriders should call super.setInputFormat(Instances)
             * Parameters:
             * instanceInfo - an Instances object containing the input instance structure (any instances contained in the object are ignored - only the structure is required).
             * Returns:
             *        true if the outputFormat may be collected immediately
             * Throws:
             * java.lang.Exception - if the inputFormat can't be set successfully
            */
            Instances dataCluster=Filter.useFilter(data, filter);
            /*public static Instances useFilter(Instances data,Filter filter)throws java.lang.Exception
             * Filters an entire set of instances through a filter and returns the new set.
             * 传入两个参数,第一个是需要进行过滤的数据,第二个是使用的过滤器,返回只为新的数据集。
             * Parameters:
             * data - the data to be filtered
             * filter - the filter to be used
             * Returns:
             *     the filtered set of data
             * Throws:
             *     java.lang.Exception - if the filter can't be used successfully
             */
            EM clusterer=new EM();
            /*
             * public class EM
             * extends RandomizableDensityBasedClusterer
             * implements NumberOfClustersRequestable, WeightedInstancesHandler
             * Simple EM (expectation maximisation) class.
             * EM assigns a probability distribution to each instance which indicates the probability of it belonging to each of the clusters. EM can decide how many clusters to create by cross validation, or you may specify apriori how many clusters to generate.
             * The cross validation performed to determine the number of clusters is done in the following steps:
             * 1. the number of clusters is set to 1
             * 2. the training set is split randomly into 10 folds.
             * 3. EM is performed 10 times using the 10 folds the usual CV way.
             * 4. the loglikelihood is averaged over all 10 results.
             * 5. if loglikelihood has increased the number of clusters is increased by 1 and the program continues at step 2.
             * The number of folds is fixed to 10, as long as the number of instances in the training set is not smaller 10. If this is the case the number of folds is set equal to the number of instances.
             * Valid options are:
             *  -N <num>
             *    number of clusters. If omitted or -1 specified, then cross validation is used to select the number of clusters.
             * -I <num>
             *     max iterations.(default 100)
             * -V
             *   verbose.
             * -M <num>
             *   minimum allowable standard deviation for normal density  computation
             *    (default 1e-6)
             * -O
             *   Display model in old format (good when there are many clusters)
             * -S <num>
             *   Random number seed.(default 100)
             */
            
            String [] options=new String[4];
            
            // max. iterations //最大迭代次数  
            options[0] = "-I";   
            options[1] = "100"; 
            //set cluster numbers,设置簇的个数  
            options[2]="-N";  
            options[3]="2";  
              
            clusterer.setOptions(options);  
            clusterer.buildClusterer(dataCluster);
            //clusterer.buildClusterer(dataClusterer);  
          
            // evaluate clusterer  
            ClusterEvaluation eval = new ClusterEvaluation();  
            eval.setClusterer(clusterer);  
            eval.evaluateClusterer(data);  
          
            // print results  
            System.out.println("数据总数:"+data.numInstances()+"属性个数为:"+data.numAttributes());
            System.out.println(eval.clusterResultsToString());  
          }  
        }
    View Code

    使用的数据是Weka安装目录下data文件夹中的labor.arff文件。

    输出的结果是:

    ''+data.classIndex()的输出内容是:16
    读取数据的属性个数一共有:17个.
    数据总数:57属性个数为:17
    
    EM
    ==
    
    Number of clusters: 2
    
    
                                     Cluster
    Attribute                              0       1
                                      (0.14)  (0.86)
    =================================================
    duration
      mean                             1.5702  2.2532
      std. dev.                        0.4953  0.6764
    
    wage-increase-first-year
      mean                             3.0708  3.9184
      std. dev.                        1.0028  1.3571
    
    wage-increase-second-year
      mean                             3.8141  3.9964
      std. dev.                        0.8153  1.0624
    
    wage-increase-third-year
      mean                             3.9133  3.9133
      std. dev.                        0.6522  0.6952
    
    cost-of-living-adjustment
      none                             7.0614 36.9386
      tcf                              1.3707  8.6293
      tc                               2.2872  6.7128
      [total]                         10.7192 52.2808
    working-hours
      mean                            39.4412 37.8196
      std. dev.                        0.8911  2.4268
    
    pension
      none                             6.4515  6.5485
      ret_allw                         2.3211  3.6789
      empl_contr                       1.9466 42.0534
      [total]                         10.7192 52.2808
    standby-pay
      mean                             6.7945  7.5462
      std. dev.                        1.4912   1.918
    
    shift-differential
      mean                             3.4074  5.1002
      std. dev.                        1.6629  3.4277
    
    education-allowance
      yes                               3.167   8.833
      no                               6.5522 42.4478
      [total]                          9.7192 51.2808
    statutory-holidays
      mean                             10.555 11.1788
      std. dev.                         0.572  1.2533
    
    vacation
      below_average                     4.657  21.343
      average                          4.0313 14.9687
      generous                         2.0309 15.9691
      [total]                         10.7192 52.2808
    longterm-disability-assistance
      yes                              2.9977 48.0023
      no                               6.7215  3.2785
      [total]                          9.7192 51.2808
    contribution-to-dental-plan
      none                             7.1218  3.8782
      half                             2.5419 34.4581
      full                             1.0556 13.9444
      [total]                         10.7192 52.2808
    bereavement-assistance
      yes                              5.7192 50.2808
      no                                    4       1
      [total]                          9.7192 51.2808
    contribution-to-health-plan
      none                             6.2887  3.7113
      half                             1.8752  9.1248
      full                             2.5554 39.4446
      [total]                         10.7192 52.2808
    Clustered Instances
    
    0       8 ( 14%)
    1      49 ( 86%)
    
    
    Log likelihood: -18.37167
    
    
    Class attribute: class
    Classes to Clusters:
    
      0  1  <-- assigned to cluster
      8 12 | bad
      0 37 | good
    
    Cluster 0 <-- bad
    Cluster 1 <-- good
    
    Incorrectly clustered instances :    12.0     21.0526 %
  • 相关阅读:
    Docker _简单使用
    IDEA常见问题
    Linux安装JDK
    vitualbox网络设置链接
    MQ对比
    乐观锁和悲观所在数据库中的实现
    11.08 JS知识
    11.07知识整理
    11.06 知识整理
    本周知识整理
  • 原文地址:https://www.cnblogs.com/AmatVictorialCuram/p/3655088.html
Copyright © 2011-2022 走看看