zoukankan      html  css  js  c++  java
  • 文本挖掘之文本聚类(OPTICS)

    刘 勇  Email:lyssym@sina.com

      鉴于DBSCAN算法对输入参数,邻域半径E和阈值M比较敏感,在参数调优时比较麻烦,因此本文对另一种基于密度的聚类算法OPTICS(Ordering Points To Identify the Clustering Structure)展开研究,该算法是DBSCAN的改进算法,与DBSCAN相比,该算法对输入参数不敏感。此外,OPTICS算法不显示地生成数据聚类,其只是对数据对象集合中的对象进行排序,获取一个有序的对象列表,其中包含了足够的信息能用来提取聚类。在实际的应用中,可利用该有序的对象序列,对数据的分布展开分析以及对数据的关联进行分析。

    基本概念

      由于OPTICS是对DBSCAN算法的一种改进,因此许多概念是共用的,如核心对象、(直接)密度可达、密度相连等,具体内容参考DBSCAN。在上述内容的基础上,本文再引入两个核心概念。

      (1) 核心距离

      在数据集合D中,对于给定的参数EM,称使得p成为核心对象的最小邻域半径为p的核心距离。具体数学表达式如下所示:

      通俗意义上来说,在给定的参数EM上,p的核心距离为距离值中的第M个最小值(最大值),该距离表征可以为欧式距离、余弦相似度或Word2Vec等。

      (2) 可达距离

      在数据集合D中,对于给定的参数EM,称对象p的核心距离与对象p和o距离,二者之间最大值为o关于p的可达距离。具体数学表达式如下所示:

      程序伪代码(参考维基百科):

     1 OPTICS(DB, eps, MinPts)
     2     for each point p of DB
     3        p.reachability-distance = UNDEFINED
     4     for each unprocessed point p of DB
     5        N = getNeighbors(p, eps)
     6        mark p as processed
     7        output p to the ordered list
     8        if (core-distance(p, eps, Minpts) != UNDEFINED)
     9           Seeds = empty priority queue
    10           update(N, p, Seeds, eps, Minpts)
    11           for each next q in Seeds
    12              N' = getNeighbors(q, eps)
    13              mark q as processed
    14              output q to the ordered list
    15              if (core-distance(q, eps, Minpts) != UNDEFINED)
    16                 update(N', q, Seeds, eps, Minpts)
    17 
    18 
    19 update(N, p, Seeds, eps, Minpts)
    20     coredist = core-distance(p, eps, MinPts)
    21     for each o in N
    22        if (o is not processed)
    23           new-reach-dist = max(coredist, dist(p,o))
    24           if (o.reachability-distance == UNDEFINED) // o is not in Seeds
    25               o.reachability-distance = new-reach-dist
    26               Seeds.insert(o, new-reach-dist)
    27           else               // o in Seeds, check for improvement
    28               if (new-reach-dist < o.reachability-distance)
    29                  o.reachability-distance = new-reach-dist
    30                  Seeds.move-up(o, new-reach-dist)

      程序源代码:

     1 import java.util.List;
     2 
     3 import com.gta.cosine.ElementDict;
     4 
     5 public class DataPoint {
     6     private List<ElementDict> terms;
     7     private double initDistance;
     8     private double coreDistance;
     9     private double reachableDistance;
    10     private boolean isVisited;
    11     
    12     
    13     public DataPoint(List<ElementDict> terms) {
    14         this.terms = terms;
    15         this.initDistance = -1;
    16         this.coreDistance = -1;
    17         this.reachableDistance = -1;
    18         this.isVisited = false;
    19     }
    20     
    21     
    22     public double getCoreDistance() {
    23         return coreDistance;
    24     }
    25 
    26 
    27     public void setCoreDistance(double coreDistance) {
    28         this.coreDistance = coreDistance;
    29     }
    30 
    31 
    32     public double getReachableDistance() {
    33         return reachableDistance;
    34     }
    35 
    36 
    37     public void setReachableDistance(double reachableDistance) {
    38         this.reachableDistance = reachableDistance;
    39     }
    40     
    41     
    42     public boolean getIsVisitLabel() {
    43         return isVisited;
    44     }
    45     
    46     
    47     public void setIsVisitLabel(boolean isVisited) {
    48         this.isVisited = isVisited;
    49     }
    50     
    51     
    52     public double getInitDistance() {
    53         return initDistance;
    54     }
    55 
    56 
    57     public void setInitDistance(double initDistance) {
    58         this.initDistance = initDistance;
    59     }
    60 
    61 
    62     public List<ElementDict> getAllElements() {
    63         return terms;
    64     }
    65     
    66     
    67     public ElementDict getElement(int index) {
    68         return terms.get(index);
    69     }
    70     
    71     
    72     public boolean equals(DataPoint dp)
    73     {
    74         List<ElementDict> ed1 = getAllElements();
    75         List<ElementDict> ed2 = dp.getAllElements();
    76         int len = ed1.size();
    77         
    78         if (len != ed2.size())
    79         {
    80             return false;
    81         }
    82         
    83         for (int i = 0; i < len; i++)
    84         {
    85             if (!ed1.get(i).equals(ed2.get(i)))
    86             {
    87                 return false;
    88             }
    89         }
    90         return true;
    91     }
    92     
    93 }
      1 import java.util.Comparator;
      2 import java.util.List;
      3 import java.util.ArrayList;
      4 import java.util.Collections;
      5 import java.util.Queue;
      6 import java.util.PriorityQueue;
      7 
      8 import com.gta.cosine.ElementDict;
      9 import com.gta.cosine.TextCosine;
     10 
     11 public class OPTICS {
     12     private double            eps;
     13     private int               minPts;
     14     private TextCosine        cosine;
     15     private List<DataPoint>   dataPoints;
     16     private List<DataPoint>   orderList;
     17     
     18     public OPTICS(double eps, int minPts)
     19     {
     20         this.eps = eps;
     21         this.minPts = minPts;
     22         this.cosine = new TextCosine();
     23         this.dataPoints = new ArrayList<DataPoint>();
     24         this.orderList = new ArrayList<DataPoint>();
     25     }
     26     
     27     
     28     public void addPoint(String s)
     29     {
     30         List<ElementDict> ed = cosine.tokenizer(s);
     31         dataPoints.add(new DataPoint(ed));
     32     }
     33     
     34     
     35     public double coreDistance(List<DataPoint> neighbors)
     36     {
     37         double ret = -1;
     38         if (neighbors.size() >= minPts)
     39         {
     40             Collections.sort(neighbors, new Comparator<DataPoint>() {
     41                         public int compare(DataPoint dp1, DataPoint dp2) {
     42                             double cd = dp1.getInitDistance() - dp2.getInitDistance();
     43                             if (cd < 0) {
     44                                 return 1;
     45                             } else {
     46                                 return -1;
     47                             }
     48                         }
     49                     });
     50             
     51             ret = neighbors.get(minPts-1).getInitDistance();
     52         }
     53         return ret;
     54     }
     55     
     56     
     57     public double cosineDistance(DataPoint p, DataPoint q)
     58     {
     59         List<ElementDict> vec1 = p.getAllElements();
     60         List<ElementDict> vec2 = q.getAllElements();
     61         return cosine.analysisText(vec1, vec2);
     62     }
     63     
     64 
     65     public List<DataPoint> getNeighbors(DataPoint p, List<DataPoint> points)
     66     {
     67         List<DataPoint> neighbors = new ArrayList<DataPoint>();
     68         double countDistance = -1;
     69         for (DataPoint q : points)
     70         {
     71             countDistance = cosineDistance(p, q);
     72             if (countDistance >= eps)
     73             {
     74                 q.setInitDistance(countDistance);
     75                 neighbors.add(q);
     76             }
     77         }
     78         return neighbors;
     79     }
     80     
     81     
     82     public void cluster(List<DataPoint> points)
     83     {
     84         for (DataPoint point : points)
     85         {
     86             if (!point.getIsVisitLabel())
     87             {
     88                 List<DataPoint> neighbors = getNeighbors(point, points);
     89                 point.setIsVisitLabel(true);
     90                 orderList.add(point);
     91                 double cd = coreDistance(neighbors);
     92                 if (cd != -1)
     93                 {
     94                     point.setCoreDistance(cd);
     95                     Queue<DataPoint> seeds = new PriorityQueue<DataPoint>(16, new Comparator<DataPoint>() {
     96                             public int compare (DataPoint dp1, DataPoint dp2) {
     97                                 double rd = dp1.getReachableDistance() - dp2.getReachableDistance();
     98                                 if (rd < 0) {
     99                                     return 1;
    100                                 } else {
    101                                     return -1;
    102                                 }
    103                             }
    104                         });
    105                     
    106                     update(point, neighbors, seeds, orderList);
    107                     while (!seeds.isEmpty()) 
    108                     {
    109                         DataPoint q = seeds.poll();
    110                         List<DataPoint> newNeighbors = getNeighbors(q, points);
    111                         q.setIsVisitLabel(true);
    112                         orderList.add(q);
    113                         if (coreDistance(newNeighbors) != -1)
    114                         {
    115                             update(q, newNeighbors, seeds, orderList);
    116                         }
    117                     }
    118                 }
    119             }
    120         }
    121     }
    122     
    123     
    124     public void update(DataPoint p, List<DataPoint> neighbors, Queue<DataPoint> seeds, List<DataPoint> seqList)
    125     {
    126         double coreDistance = coreDistance(neighbors);
    127         for (DataPoint point : neighbors)
    128         {
    129             double cosineDistance = cosineDistance(p, point);
    130             double reachableDistance = coreDistance > cosineDistance ? coreDistance : cosineDistance;
    131             if (!point.getIsVisitLabel())
    132             {
    133                 if (point.getReachableDistance() == -1)
    134                 {
    135                     point.setReachableDistance(reachableDistance);
    136                     seeds.add(point);
    137                 }
    138                 else
    139                 {
    140                     if (point.getReachableDistance() > reachableDistance)
    141                     {
    142                         if (seeds.remove(point)) 
    143                         {
    144                             point.setReachableDistance(reachableDistance);
    145                             seeds.add(point);
    146                         }
    147                     }
    148                 }
    149             }
    150             else 
    151             {
    152                 if (point.getReachableDistance() == -1)
    153                 {
    154                     point.setReachableDistance(reachableDistance);
    155                     if (seqList.remove(point))
    156                     {
    157                         seeds.add(point);
    158                     }
    159                 }
    160             }
    161         }
    162     }
    163     
    164     
    165     public void showCluster()
    166     {
    167         for (DataPoint point : orderList)
    168         {
    169             
    170             List<ElementDict> ed = point.getAllElements();
    171             for (ElementDict e : ed)
    172             {
    173                 System.out.print(e.getTerm() + "  ");
    174             }
    175             System.out.println();
    176             System.out.println("core:  " + point.getCoreDistance());
    177             System.out.println("reach: " + point.getReachableDistance());
    178             System.out.println("***************************************");
    179         }
    180     }
    181     
    182     
    183     public void analysis()
    184     {
    185         cluster(dataPoints);
    186         showCluster();
    187     }
    188     
    189     
    190     public int IndexOfList(DataPoint o, Queue<DataPoint> points)
    191     {
    192         int index = 0;
    193         for (DataPoint p : points)
    194         {
    195             if (o.equals(p))
    196             {
    197                 break;
    198             }
    199             index ++;
    200         }
    201         return index;
    202     }
    203 
    204 }

      本文计算距离时采用余弦相似度,具体内容参考本系列之文本挖掘之文本相似度判定。此外,本文经过分析,某些(个)对象之前已被访问后,例如某个边界对象,其核心距离保持为初始值,若严格按照伪代码所示处理,其结果与DBSCAN的结果有些出入,因此本文作者对OPTICS进行了一点修改,使这类对象的可达距离能被修改,并将其添加至列表中,因此,在整体上其处理结果与DBSCAN算法的处理结果保持一致。本文作者认为这样做是有效的,而且存在一定的必要性,若有更好的解决方案,请联系我

      由于OPTICS算法所获取的是对象的有序列表,对后续数据分析、挖掘,具有较高的应用价值,因此,该算法可以作为数据预处理的前奏部分。但是,该算法由于需要维护优先级队列,因而在效率上有点影响。


      作者:志青云集
      出处:http://www.cnblogs.com/lyssym
      如果,您认为阅读这篇博客让您有些收获,不妨点击一下右下角的【推荐】。
      如果,您希望更容易地发现我的新博客,不妨点击一下左下角的【关注我】。
      如果,您对我的博客所讲述的内容有兴趣,请继续关注我的后续博客,我是【志青云集】。
      本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接。


  • 相关阅读:
    翻译:实时通信协议UDP-RT——Michael Pan
    翻译:为DAW优化Windows
    翻译:Windows and Real-Time——Daniel Terhell
    笔记4:IIS6发布网站后“对XX路径的访问被拒绝”
    杂记3:VS使用Web Deploy一键发布网站到服务器
    杂记2:VS2013创建Windows服务实现自动发送邮件
    杂记1:不安装Oracle客户端远程连接Oracle的方法
    DevExpress随笔系列
    DevExpress(5): ASPxUploadControl上传照片后用ASPxBinaryImage展示
    DevExpress(4): ASPxGridView随笔
  • 原文地址:https://www.cnblogs.com/lyssym/p/4950843.html
Copyright © 2011-2022 走看看