zoukankan      html  css  js  c++  java
  • python进行EDA探索性数据分析

    1.查看数据的类型概况

    cols = [c for c in train.columns]   #返回数据的列名到列表里

    print('Number of features: {}'.format(len(cols)))

    print('Feature types:')
    train[cols].dtypes.value_counts()

    结果如下:

               Number of features: 376
               Feature types:
    
                      Out[5]:
                 int64     368
                 object      8
                 dtype: int64

    2.查看特征的数值范围

    counts = [[], [], []]
    for c in cols:
        typ = train[c].dtype
        uniq = len(np.unique(train[c]))          #利用np的unique函数看看该列一共有几个不同的数值
        if uniq == 1:                                       #  uniq==1说明该列只有一个数值
            counts[0].append(c)
        elif uniq == 2 and typ == np.int64:   #  uniq==2说明该列有两个数值,往往就是0与1的二类数值
            counts[1].append(c)
        else:
            counts[2].append(c)

    print('Constant features: {} Binary features: {} Categorical features: {} '.format(*[len(c) for c in counts]))

    print('Constant features:', counts[0])
    print('Categorical features:', counts[2])

     结果如下:

        Constant features: 12
                   Binary features: 356
        Categorical features: 10

        Constant features: ['X11', 'X93', 'X107', 'X233', 'X235', 'X268', 'X289', 'X290', 'X293', 'X297', 'X330', 'X347']
        Categorical features: ['ID', 'y', 'X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8']

    3.画出类别特征值的分布情况

    pal = sns.color_palette()

    for c in counts[2]:
      value_counts = train[c].value_counts()
      fig, ax = plt.subplots(figsize=(10, 5))
      plt.title('Categorical feature {} - Cardinality {}'.format(c, len(np.unique(train[c]))))
      plt.xlabel('Feature value')
      plt.ylabel('Occurences')
      plt.bar(range(len(value_counts)), value_counts.values, color=pal[1])
      ax.set_xticks(range(len(value_counts)))
      ax.set_xticklabels(value_counts.index, rotation='vertical')
      plt.show()

     

  • 相关阅读:
    (转载)链表环中的入口点 编程之美 leecode 学习
    leecode single numer
    leecode 树的平衡判定 java
    Let the Balloon Rise
    Digital Roots
    大数加法,A+B
    小希的迷宫
    畅通工程
    lintcode596- Minimum Subtree- easy
    lintcode597- Subtree with Maximum Average- easy
  • 原文地址:https://www.cnblogs.com/gczr/p/7084251.html
Copyright © 2011-2022 走看看