zoukankan      html  css  js  c++  java
  • Unsupervised Classification

    [comment]: # Unsupervised Classification - Sprawl Classification Algorithm

    Idea

    Points (data) in same cluster are near each others, or are connected by each others.
    So:

    • For a distance d,every points in a cluster always can find some points in the same cluster.
    • Distances between points in difference clusters are bigger than the distance d.
      The above condition maybe not correct totally, e.g. in the case of clusters which have common points, the condition will be incorrect.
      So need some improvement.

    Sprawl Classification Algorithm

    • Input:
      • data: Training Data
      • d: The minimum distance between clusters
      • minConnectedPoints: The minimum connected points:
    • Output:
      • Result: an array of classified data
    • Logical:
    Load data into TotalCache.
    i = 0
    while (TotalCache.size > 0) 
    {
        Find a any point A from TotalCache, put A into Cache2.
        Remove A from TotalCache
        In TotalCache, find points 'nearPoints' less than d from any point in the Cache2.
        Put Cache2 points into Cache1.
        Clear Cache2.
        Put nearPoints into Cache2.
        Remove nearPoints from TotalCache.
        if Cache2.size = 0, add Cache1 points into Result[i].
        Clear Cache1.
        i++
    }
    Return Result
    

    Note: As the algorithm need to calculating the distances between points, maybe need to normalize data first to each feature has same weight.

    Improvement

    A big problem is the method need too much calculation for the distances between points. The max times is (/frac{n * (n - 1)}{2}).

    Improvement ideas:

    • Check distance for one feature first maybe quicker.
      We need not to calculate the real distance for each pair, because we only need to make sure whether the distance is less than (d),
      if points x1, x2, the distance will be bigger or equals to (d) when there is a $ vert x1[i] - x2[i] vert geqslant d$.
    • Split data in multiple area
      For n dimensions (features) dataset, we can split the dataset into multiple smaller datasets, each dataset is in a n dimension space whose size (d^{n}).
      We can image that each small space is a n dimensions cube and adjoin each other.
      so we only need to calculate points in the current space and neighbour spaces.

    Cons

    • Need a amount of calculating.
    • Need to improve to handle clusters which have common points.
  • 相关阅读:
    哈希表
    跳表
    哈夫曼之谜
    选择树、判定树和查找树

    将gbk字符串转换成utf-8,存储到注册表中后,再次从注册表读取转换成gbk,有问题!!!
    函数内部还是不要使用 strtok()
    没想到: System.out.println(n1 == f1 ? n1 : f1);
    在不同DPI屏幕环境下,让图标显示的尺寸保持不变,使用 LoadImage() 加载图标
    在多线程中显示模态窗口,出现异常现象
  • 原文地址:https://www.cnblogs.com/steven-yang/p/5764718.html
Copyright © 2011-2022 走看看