zoukankan      html  css  js  c++  java
  • Pandas系列(十三)-其他常用功能

    一、统计数据频率

     1. values_counts

    pd.value_counts(df.column_name)
    df.column_name.value_counts()
    
    Series.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True)[source]
    Return a Series containing counts of unique values.

      参数详解

    normalize : boolean, default False
    If True then the object returned will contain the relative frequencies of the unique values.
    
    sort : boolean, default True
    Sort by values.
    
    ascending : boolean, default False
    Sort in ascending order.
    
    bins : integer, optional
    Rather than count values, group them into half-open bins, a convenience for pd.cut, only works with numeric data.
    
    dropna : boolean, default True
    Don’t include counts of NaN.
    

      参数示例讲解

    index = pd.Index([3, 1, 2, 3, 4, np.nan])
    index.value_counts()
    Out[144]: 
    3.0    2
    4.0    1
    2.0    1
    1.0    1
    dtype: int64
    index.value_counts(normalize=True)
    Out[145]: 
    3.0    0.4
    4.0    0.2
    2.0    0.2
    1.0    0.2
    dtype: float64
    index.value_counts(bins=3)
    Out[146]: 
    (2.0, 3.0]      2
    (0.996, 2.0]    2
    (3.0, 4.0]      1
    dtype: int64
    index.value_counts(dropna=False)
    Out[148]: 
     3.0    2
    NaN     1
     4.0    1
     2.0    1
     1.0    1
    dtype: int64 
    In [21]:  data=pd.DataFrame(pd.Series([1,2,3,4,5,6,11,1,1,1,1,2,2,2,2,3]).values.reshape(4,4),columns=['a','b','c','d'])
    
    In [22]: data
    Out[22]: 
       a  b   c  d
    0  1  2   3  4
    1  5  6  11  1
    2  1  1   1  2
    3  2  2   2  3
    
    In [23]: pd.value_counts(data.a)
    Out[23]: 
    1    2
    2    1
    5    1
    Name: a, dtype: int64
    
    In [26]: pd.value_counts(data.a).sort_index()
    Out[26]: 
    1    2
    2    1
    5    1
    Name: a, dtype: int64
    
    In [27]: pd.value_counts(data.a).sort_index().index
    Out[27]: Int64Index([1, 2, 5], dtype='int64')
    
    In [28]: pd.value_counts(data.a).sort_index().values
    Out[28]: array([2, 1, 1], dtype=int64)
    values_count实例

     2.类别中属性的个数

    # 方式一
    cat_uniques = []
    for cat in cat_features:
        cat_uniques.append(len(train[cat].unique()))
    uniq_values_in_categories = pd.DataFrame.from_items([('cat_name', cat_features), ('unique_values', cat_uniques)])
    
    # 方式二
    list(map(lambda x: len(train[x]),cat_featrues))
    

     3.唯一值

    二、数据去重

        def drop_duplicates(self, subset=None, keep='first', inplace=False):
            """
            Return DataFrame with duplicate rows removed, optionally only
            considering certain columns
    
            Parameters
            ----------
            subset : column label or sequence of labels, optional
                Only consider certain columns for identifying duplicates, by
                default use all of the columns
            keep : {'first', 'last', False}, default 'first'
                - ``first`` : Drop duplicates except for the first occurrence.
                - ``last`` : Drop duplicates except for the last occurrence.
                - False : Drop all duplicates.
            inplace : boolean, default False
                Whether to drop duplicates in place or to return a copy
    
            Returns
            -------
            deduplicated : DataFrame
            """
    

    1. duplicated()方法
    • 查看列是否重复
    dataframe.colname.duplicated()
    • 查看整行是否重复
    dataframe.duplicated()
    • 查看subset是否重复
    dataframe.duplicated(subset = [])
    1. drop_duplicats()方法
      用于丢弃重复项
    dataframe.drop_duplicats()
    1. 参数keep
      keep可以为first和last,表示是选择最前一项还是最后一项保留。
    dataframe.duplicated(keep = "first")
    dataframe.duplicated(keep = "last")

    也可以设置布尔类型,当设为False时候,重复项将都被显示。

    dataframe.duplicated(keep = "False")

     三、数据类型转换

    四、聚合方法

    import pandas as pd
    from matplotlib import pyplot as plt
    import numpy as np
    pd.set_option('display.max_columns',None)
    df = pd.read_csv('911.csv')
    print(df.head(1))
             lat        lng                                               desc  
    0  40.297876 -75.581294  REINDEER CT & DEAD END;  NEW HANOVER; Station ...   
           zip                   title            timeStamp          twp  
    0  19525.0  EMS: BACK PAINS/INJURY  2015-12-10 17:10:52  NEW HANOVER   
                         addr  e  
    0  REINDEER CT & DEAD END  1  
    df.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 249737 entries, 0 to 249736
    Data columns (total 9 columns):
    lat          249737 non-null float64
    lng          249737 non-null float64
    desc         249737 non-null object
    zip          219391 non-null float64
    title        249737 non-null object
    timeStamp    249737 non-null object
    twp          249644 non-null object
    addr         249737 non-null object
    e            249737 non-null int64
    dtypes: float64(3), int64(1), object(5)
    memory usage: 17.1+ MB
    #获取分类
    temp_list = df.title.str.split(':').tolist()
    cate_list = list(set([i[0] for i in temp_list]))
    cate_list
    Out[152]: ['Fire', 'Traffic', 'EMS']
    #构造全为0的数组
    zeros_df = pd.DataFrame(np.zeros((df.shape[0],len(cate_list))),columns=cate_list)
    #赋值
    for cate in cate_list:
        zeros_df[cate][df.title.str.contains(cate)] = 1
    print(zeros_df)
            Fire  Traffic  EMS
    0        0.0      0.0  1.0
    1        0.0      0.0  1.0
    2        1.0      0.0  0.0
    3        0.0      0.0  1.0
    4        0.0      0.0  1.0
    5        0.0      0.0  1.0
    6        0.0      0.0  1.0
    7        0.0      0.0  1.0
    8        0.0      0.0  1.0
    9        0.0      1.0  0.0
    10       0.0      1.0  0.0
    11       0.0      1.0  0.0
    12       0.0      1.0  0.0
    13       0.0      1.0  0.0
    14       0.0      1.0  0.0
    15       0.0      1.0  0.0
    16       0.0      0.0  1.0
    17       0.0      0.0  1.0
    18       0.0      0.0  1.0
    19       0.0      1.0  0.0
    20       0.0      1.0  0.0
    21       0.0      1.0  0.0
    22       1.0      0.0  0.0
    23       0.0      1.0  0.0
    24       0.0      1.0  0.0
    25       0.0      0.0  1.0
    26       0.0      0.0  1.0
    27       1.0      0.0  0.0
    28       0.0      1.0  0.0
    29       0.0      1.0  0.0
          ...      ...  ...
    249707   0.0      1.0  0.0
    249708   1.0      0.0  0.0
    249709   0.0      0.0  1.0
    249710   0.0      1.0  0.0
    249711   0.0      1.0  0.0
    249712   0.0      0.0  1.0
    249713   1.0      0.0  0.0
    249714   1.0      0.0  0.0
    249715   0.0      1.0  0.0
    249716   0.0      0.0  1.0
    249717   0.0      0.0  1.0
    249718   1.0      0.0  0.0
    249719   0.0      0.0  1.0
    249720   0.0      0.0  1.0
    249721   0.0      0.0  1.0
    249722   0.0      1.0  0.0
    249723   0.0      0.0  1.0
    249724   0.0      0.0  1.0
    249725   0.0      0.0  1.0
    249726   1.0      0.0  0.0
    249727   1.0      0.0  0.0
    249728   0.0      1.0  0.0
    249729   0.0      0.0  1.0
    249730   0.0      0.0  1.0
    249731   0.0      1.0  0.0
    249732   0.0      0.0  1.0
    249733   0.0      0.0  1.0
    249734   0.0      0.0  1.0
    249735   1.0      0.0  0.0
    249736   0.0      1.0  0.0
    [249737 rows x 3 columns]
    sum_ret = zeros_df.sum(axis=0)
    print(sum_ret)
    Fire        37432.0
    Traffic     87465.0
    EMS        124844.0
    dtype: float64
    不同类型紧急情况统计
    df = pd.read_csv('911.csv')
    # print(df.head(1))
    # print(df.info())
    #获取分类
    temp_list = df.title.str.split(':').tolist()
    cate_list = [i[0] for i in temp_list]
    df['cate'] = pd.DataFrame(np.array(cate_list).reshape(df.shape[0],1))
    print(df.groupby(by='cate').count()['title'])
    cate
    EMS        124840
    Fire        37432
    Traffic     87465
    Name: title, dtype: int64
    实例2

    五、分布分析

    pd.cut(data['col_names'], bins, labels=None)
    

     实例

    import numpy
    import pandas
    from pandas import read_csv
    data = read_csv('E:/python/data_analysis/data/dis-cut.csv')
    bins = [data['年龄'].min()-1,20,30,40,max(data.年龄)+1]
    labels = ['20岁及以下','21岁到30岁','31岁到40岁','41岁以上']
    age_cut = pandas.cut(data.年龄,bins,labels=labels)
    data['年龄分层'] = age_cut
    result = data.groupby(by=['年龄分层'])['年龄'].agg(['size','mean'])
    result.rename(columns= {'size': '人数','mean': '平均年龄'})
    Out[171]: 
                人数       平均年龄
    年龄分层                     
    20岁及以下    2061  19.302280
    21岁到30岁  46858  25.759081
    31岁到40岁   8729  33.095773
    41岁以上     1453  50.625602
    

     六、交叉分析

    import pandas
    from pandas import read_csv
    data = read_csv('E:/python/data_analysis/data/pivot_table.csv')
    bins = [data['年龄'].min() - 1, 20, 30, 40, max(data.年龄) + 1]
    labels = ['20岁及以下', '21岁到30岁', '31岁到40岁', '41岁以上']
    age_cut = pandas.cut(data.年龄, bins, labels=labels)
    data['年龄分层'] = age_cut
    r1 = data.pivot_table(
        values=['年龄'],
        index=['年龄分层'],
        columns=['性别'],
        aggfunc=[numpy.size, numpy.mean]
    )
    r2 = data.pivot_table(
        values=['年龄'],
        index=['年龄分层'],
        columns=['性别'],
        aggfunc=[numpy.std],
    )
    print(r1.index)
    print(r1['size']['年龄'][''])
    print(r1.join(r2))    
    CategoricalIndex(['41岁以上', '21岁到30岁', '31岁到40岁', '20岁及以下'], categories=['20岁及以下', '21岁到30岁', '31岁到40岁', '41岁以上'], ordered=True, name='年龄分层', dtype='category')
    年龄分层
    41岁以上       111
    21岁到30岁    2903
    31岁到40岁     735
    20岁及以下      567
    Name: 女, dtype: int64
             size              mean                  std          
               年龄                年龄                   年龄          
    性别          女      男          女          男         女         男
    年龄分层                                                          
    41岁以上     111   1950  18.972973  19.321026  1.708053  1.044185
    21岁到30岁  2903  43955  25.954874  25.746149  2.453642  2.361298
    31岁到40岁   735   7994  33.213605  33.084939  2.316704  2.200319
    20岁及以下    567    886  51.691358  49.943567  6.761848  7.914171
    交叉分析pivot_table案例
    import pandas as pd
    import numpy as np
    data = pd.DataFrame({'Sample': range(1, 11), 'Gender': ['Female', 'Male', 'Female', 'Male', 'Male', 'Male', 'Female', 'Female', 'Male', 'Female'], 
                        'Handedness': ['Right-handed', 'Left-handed', 'Right-handed', 'Right-handed', 'Left-handed', 'Right-handed', 'Right-handed', 'Left-handed', 'Right-handed', 'Right-handed']})
    #假设我们想要根据性别和用手习惯对这段数据进行#统计汇总。虽然可以用pivot_table()实现该功#能,但是pandas.crosstab()函数会更方便:
    # 方法一:用pivot_table
    # 其实我觉的一点都不麻烦ε=(´ο`*)))唉
    data.pivot_table(index=['Gender'], columns='Handedness', aggfunc=len, margins=True)
    Out[173]: 
                    Sample                 
    Handedness Left-handed Right-handed All
    Gender                                 
    Female               1            4   5
    Male                 2            3   5
    All                  3            7  10
    # 方法二:用crosstab
    pd.crosstab(data.Gender, data.Handedness, margins=True)
    Out[174]: 
    Handedness  Left-handed  Right-handed  All
    Gender                                    
    Female                1             4    5
    Male                  2             3    5
    All                   3             7   10
    交叉分析crosstab案例
    • 透视表pivot_table()是一种进行分组统计的函数,参数aggfunc决定统计类型;
    • 交叉表crosstab()是一种特殊的pivot_table(),专用于计算分组频率。

    具体使用参照 https://www.cnblogs.com/onemorepoint/p/8425300.html

     七、格式化输出

    pandas dataframe数据全部输出,数据太多也不用省略号表示。

    pd.set_option('display.max_columns',None)

    或者

    with option_context('display.max_rows', 10, 'display.max_columns', 5):
        代码逻辑
    

     八、索引

      

  • 相关阅读:
    Java Class/Method
    Data Types in Java
    Java overview && JVM
    Interface Vs. Abstract Class
    【javascript】Object三种定义方式
    【C#】导出Excel
    【javascript】简单原型链、借用构造函数
    Mysql表编码查看修改
    asp.net的mvc?
    连表
  • 原文地址:https://www.cnblogs.com/zhangyafei/p/10520410.html
Copyright © 2011-2022 走看看