zoukankan      html  css  js  c++  java
  • Python for Data Science

    Chapter 4 - Clustering Models

    Segment 1 - K-means method

    Clustering and Classification Algorithms

    K-Means clustering: unsupervised clustering algorithm where you know how many clusters are appropriate

    K-Means Use Cases

    • Market Price and Cost Modeling
    • Insurance Claim Fraud Detection
    • Hedge Fund Classification
    • Customer Segmentation

    K-Means Clustering

    Predictions are based on the number of centroids present(K) and nearest mean values, given an Euclidean distance measurement between observations.

    When using K-means:

    • Scale your variables
    • Look at a scatterplot or the data table to estimate the appropriate number of centroids to use for the K parameter value

    Setting up for clustering analysis

    import numpy as np
    import pandas as pd
    
    import matplotlib.pyplot as plt
    
    import sklearn
    from sklearn.preprocessing import scale
    import sklearn.metrics as sm
    from sklearn.metrics import confusion_matrix, classification_report
    
    from sklearn.cluster import KMeans
    from mpl_toolkits.mplot3d import Axes3D
    from sklearn import datasets
    
    %matplotlib inline
    plt.figure(figsize=(7,4))
    
    <Figure size 504x288 with 0 Axes>
    
    
    
    
    <Figure size 504x288 with 0 Axes>
    
    iris = datasets.load_iris()
    
    X = scale(iris.data)
    y = pd.DataFrame(iris.target)
    varibale_names = iris.feature_names
    X[0:10]
    
    array([[-0.90068117,  1.01900435, -1.34022653, -1.3154443 ],
           [-1.14301691, -0.13197948, -1.34022653, -1.3154443 ],
           [-1.38535265,  0.32841405, -1.39706395, -1.3154443 ],
           [-1.50652052,  0.09821729, -1.2833891 , -1.3154443 ],
           [-1.02184904,  1.24920112, -1.34022653, -1.3154443 ],
           [-0.53717756,  1.93979142, -1.16971425, -1.05217993],
           [-1.50652052,  0.78880759, -1.34022653, -1.18381211],
           [-1.02184904,  0.78880759, -1.2833891 , -1.3154443 ],
           [-1.74885626, -0.36217625, -1.34022653, -1.3154443 ],
           [-1.14301691,  0.09821729, -1.2833891 , -1.44707648]])
    

    Building and running your model

    clustering = KMeans(n_clusters=3, random_state=5)
    
    clustering.fit(X)
    
    KMeans(n_clusters=3, random_state=5)
    

    Plotting your model outputs

    iris_df = pd.DataFrame(iris.data)
    iris_df.columns = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width']
    y.columns = ['Targets']
    
    color_theme = np.array(['darkgray','lightsalmon','powderblue'])
    
    plt.subplot(1,2,1)
    
    plt.scatter(x=iris_df.Petal_Length, y=iris_df.Petal_Width, c=color_theme[iris.target],s=50)
    plt.title('Ground Truth Classfication')
    
    plt.subplot(1,2,2)
    
    plt.scatter(x=iris_df.Petal_Length, y=iris_df.Petal_Width, c=color_theme[clustering.labels_],s=50)
    plt.title('K-Means Classfication')
    
    Text(0.5, 1.0, 'K-Means Classfication')
    

    ML04output_9_1

    relabel = np.choose(clustering.labels_, [2, 0, 1]).astype(np.int64)
    
    plt.subplot(1,2,1)
    
    plt.scatter(x=iris_df.Petal_Length, y=iris_df.Petal_Width, c=color_theme[iris.target],s=50)
    plt.title('Ground Truth Classfication')
    
    plt.subplot(1,2,2)
    
    plt.scatter(x=iris_df.Petal_Length, y=iris_df.Petal_Width, c=color_theme[relabel],s=50)
    plt.title('K-Means Classfication')
    
    Text(0.5, 1.0, 'K-Means Classfication')
    

    ML04output_10_1

    Evaluate your clustering results

    print(classification_report(y, relabel))
    
                  precision    recall  f1-score   support
    
               0       1.00      1.00      1.00        50
               1       0.74      0.78      0.76        50
               2       0.77      0.72      0.74        50
    
        accuracy                           0.83       150
       macro avg       0.83      0.83      0.83       150
    weighted avg       0.83      0.83      0.83       150
  • 相关阅读:
    三元操作符的类型务必一致
    a++ 和 ++a 的区别
    TCP/IP四层协议模型与ISO七层模型
    CentOS 7.0 使用 yum 安装 MariaDB 与 MariaDB 的简单配置
    修改gcc/g++默认include路径
    js中Math.random()生成指定范围数值的随机数
    在 QML 中创建 C++ 导入类型的实例
    QML从文件加载组件简单示例
    Android插件化开发---执行未安装apk中的Service
    游戏开发热门技术浅析
  • 原文地址:https://www.cnblogs.com/keepmoving1113/p/14319079.html
Copyright © 2011-2022 走看看