zoukankan      html  css  js  c++  java
  • hierarchy 在大数据上聚类的利弊

    Well, hierarchical clustering doesn't make that much sense for large datasets. It's actually mostly a textbook example in my opinion. The problem with hierarchical clustering is that it doesn't really build sensible clusters. It builds a dendrogram, but with 14000 objects the dendrogram becomes pretty much unusable. And very few implementations of hierarchical clustering have non-trivial methods to extract sensible clusters from the dendrogram. Plus, in the general case, hierarchical clustering is of complexityO(n^3) which makes it scale really bad to large datasets.

    DBSCAN technically does not need a distance matrix. In fact, when you use a distance matrix, it will beslow, as computing the distance matrix already is O(n^2). And even then, you can safe the O(n^2)memory cost for DBSCAN by computing the distances on the fly at the cost of computing distances twice each. DBSCAN visits each point once, so there is next to no benefit from using a distance matrix except the symmetry gain. And technically, you could do some neat caching tricks to even reduce that, since DBSCAN also just needs to know which objects are below the epsilon threshold. When the epsilon is chosen reasonably, managing the neighbor sets on the fly will use significantly less memory thanO(n^2) at the same CPU cost of computing the distance matrix.

    Any really good implementation of DBSCAN (it is spelled all uppercas

  • 相关阅读:
    js 中的 EventLoop
    线程的并发工具类
    xpath获取某个节点下的全部字节点的文本
    2020中国 .NET开发者大会精彩回顾:葡萄城高性能表格技术解读
    .NET 控件集 ComponentOne V2020.0 Update3 发布,正式支持 .NET 5
    log4net配置
    TP5.1 爬虫
    pip下载慢
    TP5.1 二维码生成
    composer插件集合
  • 原文地址:https://www.cnblogs.com/harveyaot/p/3408225.html
Copyright © 2011-2022 走看看