zoukankan      html  css  js  c++  java
  • pandas 之 交叉表-透视表

    import numpy as np 
    import pandas as pd 
    

    认识

    A pivot table is a data summarization tool(数据汇总工具) frequently found in spreadsheet programs and other data analysis software(广泛应用于数据分析中). It aggregates a table of data by one or more keys, arranging the data in a rectangle(矩形) with some of the group keys along the rows and some along the columns.
    Pivot tables in Python with pandas are made possible through the groupby facility(促进) described in this chapter combined with reshape operations utilizing hierarchical indexing.
    DataFrame has a pivot_table method, and there is also a top-level pandas.pivot_table function. In addition to providing a convenience interface to groupby, pivot_table can add partial totals , also known as margins.

    Returning to the tipping dataset, suppose you wanted to compute a table of group means(the default pivot_table aggregation type) arranged by day and smoker on the rows: (对分组计算组内平均)

    tips = pd.read_csv('../examples/tips.csv')
    
    "新增一列 tip_pct"
    
    tips['tip_pct'] = tips['tip'] / tips['total_bill']
    
    tips[:6]
    
    '新增一列 tip_pct'
    
    total_bill tip smoker day time size tip_pct
    0 16.99 1.01 No Sun Dinner 2 0.059447
    1 10.34 1.66 No Sun Dinner 3 0.160542
    2 21.01 3.50 No Sun Dinner 3 0.166587
    3 23.68 3.31 No Sun Dinner 2 0.139780
    4 24.59 3.61 No Sun Dinner 4 0.146808
    5 25.29 4.71 No Sun Dinner 4 0.186240
    "默认的aggregation 是 mean"
    tips.pivot_table(index=['day', 'smoker'])
    
    '默认的aggregation 是 mean'
    
    size tip tip_pct total_bill
    day smoker
    Fri No 2.250000 2.812500 0.151650 18.420000
    Yes 2.066667 2.714000 0.174783 16.813333
    Sat No 2.555556 3.102889 0.158048 19.661778
    Yes 2.476190 2.875476 0.147906 21.276667
    Sun No 2.929825 3.167895 0.160113 20.506667
    Yes 2.578947 3.516842 0.187250 24.120000
    Thur No 2.488889 2.673778 0.160298 17.113111
    Yes 2.352941 3.030000 0.163863 19.190588

    This could have been produced with groupby directly. Now, suppose we want to aggregate only tip_pct and size, and additionally group by time. I'll put smoker in the table columns and day in the rows:

    tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'],
                     columns='smoker')
    
    size tip_pct
    smoker No Yes No Yes
    time day
    Dinner Fri 2.000000 2.222222 0.139622 0.165347
    Sat 2.555556 2.476190 0.158048 0.147906
    Sun 2.929825 2.578947 0.160113 0.187250
    Thur 2.000000 NaN 0.159744 NaN
    Lunch Fri 3.000000 1.833333 0.187735 0.188937
    Thur 2.500000 2.352941 0.160311 0.163863

    We could augment this table to include partial totals by passing margins=True. This has the effect of adding all row and column labels, with corresponding values being the group statistics for all the data within a single tier:

    tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'],
                    columns='smoker', margins=True)
    
    size tip_pct
    smoker No Yes All No Yes All
    time day
    Dinner Fri 2.000000 2.222222 2.166667 0.139622 0.165347 0.158916
    Sat 2.555556 2.476190 2.517241 0.158048 0.147906 0.153152
    Sun 2.929825 2.578947 2.842105 0.160113 0.187250 0.166897
    Thur 2.000000 NaN 2.000000 0.159744 NaN 0.159744
    Lunch Fri 3.000000 1.833333 2.000000 0.187735 0.188937 0.188765
    Thur 2.500000 2.352941 2.459016 0.160311 0.163863 0.161301
    All 2.668874 2.408602 2.569672 0.159328 0.163196 0.160803

    Here, the All values are means without taking into account smoker versus non-smoker or any of the two levels of grouping on the rows.

    To use a different aggregation function, pass it to aggfunc. For example, count or len will give you a cross-tabulation of group sizes:

    tips.pivot_table('tip_pct', index=['time', 'smoker'],
                    columns='day', aggfunc=len, margins=True)
    
    
    
    day Fri Sat Sun Thur All
    time smoker
    Dinner No 3.0 45.0 57.0 1.0 106.0
    Yes 9.0 42.0 19.0 NaN 70.0
    Lunch No 1.0 NaN NaN 44.0 45.0
    Yes 6.0 NaN NaN 17.0 23.0
    All 19.0 87.0 76.0 62.0 244.0

    If some combinations are empty, you may wish to pass a fill_value

    tips.pivot_table('tip_pct', index=['time', 'size', 'smoker'],
                    columns='day', aggfunc='mean', fill_value=0)
    
    day Fri Sat Sun Thur
    time size smoker
    Dinner 1 No 0.000000 0.137931 0.000000 0.000000
    Yes 0.000000 0.325733 0.000000 0.000000
    2 No 0.139622 0.162705 0.168859 0.159744
    Yes 0.171297 0.148668 0.207893 0.000000
    3 No 0.000000 0.154661 0.152663 0.000000
    Yes 0.000000 0.144995 0.152660 0.000000
    4 No 0.000000 0.150096 0.148143 0.000000
    Yes 0.117750 0.124515 0.193370 0.000000
    5 No 0.000000 0.000000 0.206928 0.000000
    Yes 0.000000 0.106572 0.065660 0.000000
    6 No 0.000000 0.000000 0.103799 0.000000
    Lunch 1 No 0.000000 0.000000 0.000000 0.181728
    Yes 0.223776 0.000000 0.000000 0.000000
    2 No 0.000000 0.000000 0.000000 0.166005
    Yes 0.181969 0.000000 0.000000 0.158843
    3 No 0.187735 0.000000 0.000000 0.084246
    Yes 0.000000 0.000000 0.000000 0.204952
    4 No 0.000000 0.000000 0.000000 0.138919
    Yes 0.000000 0.000000 0.000000 0.155410
    5 No 0.000000 0.000000 0.000000 0.121389
    6 No 0.000000 0.000000 0.000000 0.173706

    See Table 10-2 for a summary of pivot_table methods.

    function anme Description
    values Column name or names to aggregate; 默认聚合所有的数值列
    index Column names or other group keys to group on the rows of the resulting pivot table
    columns Column names or other group keys to group on the columns of the result pivot table
    aggfunc Aggregation function or list of function(默认是mean); can be any function valid in a groupby context
    fill_value Replace missing values in result table
    dropna If True, do not include columns whose entries are all NA
    margins Add row/column subtotals and grand total

    交叉表: Crosstab

    • 是透视表的一部分, aggfunc=count而已
      A cross-tabulation (or crosstab for short) is a special case of a pivot table that computes group frequencies.Here is an example:

    As part of some survey analysis, we might want to summarize this data nationality and handedness. You could use pivot_table to do this, but the pandas.crosstab function can be more convenient:

    pd.crosstab(data.Nationality, data.Handedness, margins=True)
    

    The first two arguments to crosstab can each either be an array or Series or a list of arrays. As in the tips data:

    "根据 day, time 对 smoker 进行统计"
    pd.crosstab([tips.time, tips.day], tips.smoker, margins=True)
    
    '根据 day, time 对 smoker 进行统计'
    
    smoker No Yes All
    time day
    Dinner Fri 3 9 12
    Sat 45 42 87
    Sun 57 19 76
    Thur 1 0 1
    Lunch Fri 1 6 7
    Thur 44 17 61
    All 151 93 244

    小结

    Mastering pandas's data grouping tools can help both with data cleaning as well as modeling or statistical analysis work.
    (熟练掌握 groupby 对 数据清洗, 建模统计等都是有认识和实操方面的帮助的.)

  • 相关阅读:
    图的概述
    "《算法导论》之‘排序’":线性时间排序
    “《算法导论》之‘查找’”:散列表
    如何使用VS2013本地C++单元测试框架
    “《算法导论》之‘查找’”:顺序查找和二分查找
    查找算法概述
    第二部分 位运算符、赋值运算符、三元及一元运算符和语句分类
    LINQ 的查询_联表、分组、排序
    第二部分 关系与比较运算符 、 自增与自减运算符、条件逻辑运算符
    LINQ to Sql系列一 增,删,改
  • 原文地址:https://www.cnblogs.com/chenjieyouge/p/12031901.html
Copyright © 2011-2022 走看看