zoukankan      html  css  js  c++  java
  • Clustering of Multivariate data 多源数据的聚类

    Please I am about to cluster some data based which have about 15 different columns all of which are numbers(Some categorical while some are measurements) also some of my values are missing in some columns . Please can you give me pointer on how to go about it.

    I have previously explored the clustering with weka but I am not sure about the way weka implements so I am going the R route.

    What I know : I already know about Principal components analysis at least in theory. But is this necessary in all clustering of multiple columns . It will go a long way if anyone could provide me a link to a tutorial on this because Quick-R has for just 2 variables.

    A sample of my dataset is listed below:

    1,64,9,30,33,2,3,1,6,1,5,-3.62,-3.71,-2.73,1
    2,61,4,30,33,2,3,2,7,4,4,-3.62,-3.71,-2.00,1
    3,49,4,18,21,2,3,2,8,17,18,-3.68,-3.88,-2.00,1
    4,40,4,10,12,2,2,2,24,20,23,-3.32,-3.42,-2.00,1
    5,43,9,10,12,2,2,1,2,1,29,-3.12,-3.19,-2.73,1
    6,52,9,16,19,2,3,2,35,34,35,-3.33,-3.26,-2.95,1
    7,46,4,15,18,2,3,2,8,40,42,-3.59,-3.50,-2.00,1
    8,40,4,10,12,2,2,2,24,20,46,-2.45,-2.69,-2.00,1
    

      ound this website that deals with it but has nothing on mixed categorical and continuous data http://spss.me.holycross.edu/2011/01/13/multivariate-analysis-with-r/

    Answer:

    You should explode categorical features with n possible values (e.g. "color" can be "red", "purple" or "blue") into n boolean features (e.g. "color/red" with value 1.0 if "color" == "red" or 0.0 otherwise, and so on for "color/purple" and "color/blue"). Then standardize all the features (e.g all the boolean features that replace the categorical feature and the numerical features). Then run kmeans or any other clustering algorithm on the resulting data.

    By standardizing the data I mean: center the data (remove the feature means) and scale to unit variance by dividing each feature value by the standard deviation of that feature across your samples.

    instead of naive feature-wise standardization you could project your data onto its first principal components (truncated PCA - e.g. truncate so as to retain 95% of the variance while dropping components with very small singular values) and divide the transformed features by the squared singular values to get unit variance features (whitening). This will remove linear correlation among features in the original (boolean+numerical) features space. I don't know if linear correlation is really hurting clustering in practice.

  • 相关阅读:
    本地运行FlashPlayer怎么样才能访问本地文件
    html em和px的关系
    css display 的用法
    关于css中div的定位(绝对定位和相对定位)(转载)
    html id同name的区别
    免费软件 认出图像文件中文字的利器
    js鼠标滑过弹出层的定位bug解决办法(转)
    开始看struts2
    NYOJ 106(背包)
    HDOJ 1012
  • 原文地址:https://www.cnblogs.com/xiangshancuizhu/p/2168966.html
Copyright © 2011-2022 走看看