zoukankan      html  css  js  c++  java
  • pandas 之 索引重塑

    import numpy as np 
    import pandas as pd
    

    There are a number of basic operations for rearanging tabular data. These are alternatingly referred to as reshape or pivot operations.

    多层索引重塑

    Hierarchical indexing provides a consistent way to rearrange data in a DataFrame. There are two primary actions:

    stack - 列拉长index
    ​ This "rotates" or pivots from the columns in the data to the rows.

    unstack
    ​ This pivots from the rows into the columns.

    I'll illustrate these operations through a series of examples. Consider a small DataFrame with string arrays as row and column indexes:

    data = pd.DataFrame(np.arange(6).reshape((2, 3)),
        index=pd.Index(['Ohio', 'Colorado'], name='state'),
        columns=pd.Index(['one', 'two', 'three'],
        name='number'))
    
    data
    
    number one two three
    state
    Ohio 0 1 2
    Colorado 3 4 5

    Using the stack method on this data pivots the columns into the rows, producing a Series.

    "stack 将每一行, 叠成一个Series, 堆起来"
    result = data.stack()
    
    result
    
    'stack 将每一行, 叠成一个Series, 堆起来'
    
    
    
    
    
    
    state     number
    Ohio      one       0
              two       1
              three     2
    Colorado  one       3
              two       4
              three     5
    dtype: int32
    

    From a hierarchically indexed Series, you can rearrage the data back into a DataFrame with unstack

    "unstack 将叠起来的Series, 变回DF"
    
    result.unstack()
    
    'unstack 将叠起来的Series, 变回DF'
    
    number one two three
    state
    Ohio 0 1 2
    Colorado 3 4 5

    By default the innermost level is unstacked(same with stack). You can unstack a different level by passing a level number or name.

    result.unstack(level=0)
    
    state Ohio Colorado
    number
    one 0 3
    two 1 4
    three 2 5
    result.unstack(level='state')
    
    state Ohio Colorado
    number
    one 0 3
    two 1 4
    three 2 5

    Unstacking might introduce missing data if all of the values in the level aren't found in each of the subgroups.

    s1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
    
    s2 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])
    
    data2 = pd.concat([s1, s2], keys=['one', 'two'])
    
    data2
    
    one  a    0
         b    1
         c    2
         d    3
    two  c    4
         d    5
         e    6
    dtype: int64
    
    data2.unstack()  # 外连接哦
    
    a b c d e
    one 0.0 1.0 2.0 3.0 NaN
    two NaN NaN 4.0 5.0 6.0
    %time data2.unstack().stack()
    
    Wall time: 5 ms
    
    
    
    
    
    one  a    0.0
         b    1.0
         c    2.0
         d    3.0
    two  c    4.0
         d    5.0
         e    6.0
    dtype: float64
    
    %time data2.unstack().stack(dropna=False)
    
    Wall time: 3 ms
    
    
    
    
    
    one  a    0.0
         b    1.0
         c    2.0
         d    3.0
         e    NaN
    two  a    NaN
         b    NaN
         c    4.0
         d    5.0
         e    6.0
    dtype: float64
    

    When you unstack in a DataFrame, the level unstacked becomes the lowest level in the result:

    df = pd.DataFrame({'left': result, 'right': result + 5},
    columns=pd.Index(['left', 'right'], name='side'))
    
    df
    
    side left right
    state number
    Ohio one 0 5
    two 1 6
    three 2 7
    Colorado one 3 8
    two 4 9
    three 5 10
    df.unstack("state")
    
    side left right
    state Ohio Colorado Ohio Colorado
    number
    one 0 3 5 8
    two 1 4 6 9
    three 2 5 7 10

    When calling stack, we can indicate the name of the axis to stack:

    %time df.unstack('state').stack('side')
    
    Wall time: 118 ms
    
    state Colorado Ohio
    number side
    one left 3 0
    right 8 5
    two left 4 1
    right 9 6
    three left 5 2
    right 10 7

    长转宽形

    A common way to store multiple time series in databases and CSV is in so-called long or stacked format. Let's load some example data and do a small amonut of time series wrangling and other data cleaning:

    %%time
    
    data = pd.read_csv("../examples/macrodata.csv")
    
    data.info()
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 203 entries, 0 to 202
    Data columns (total 14 columns):
    year        203 non-null float64
    quarter     203 non-null float64
    realgdp     203 non-null float64
    realcons    203 non-null float64
    realinv     203 non-null float64
    realgovt    203 non-null float64
    realdpi     203 non-null float64
    cpi         203 non-null float64
    m1          203 non-null float64
    tbilrate    203 non-null float64
    unemp       203 non-null float64
    pop         203 non-null float64
    infl        203 non-null float64
    realint     203 non-null float64
    dtypes: float64(14)
    memory usage: 22.3 KB
    Wall time: 142 ms
    
    data.head()
    
    year quarter realgdp realcons realinv realgovt realdpi cpi m1 tbilrate unemp pop infl realint
    0 1959.0 1.0 2710.349 1707.4 286.898 470.045 1886.9 28.98 139.7 2.82 5.8 177.146 0.00 0.00
    1 1959.0 2.0 2778.801 1733.7 310.859 481.301 1919.7 29.15 141.7 3.08 5.1 177.830 2.34 0.74
    2 1959.0 3.0 2775.488 1751.8 289.226 491.260 1916.4 29.35 140.5 3.82 5.3 178.657 2.74 1.09
    3 1959.0 4.0 2785.204 1753.7 299.356 484.052 1931.3 29.37 140.0 4.33 5.6 179.386 0.27 4.06
    4 1960.0 1.0 2847.699 1770.5 331.722 462.199 1955.5 29.54 139.6 3.50 5.2 180.007 2.31 1.19
    periods = pd.PeriodIndex(year=data.year, quarter=data.quarter, name='date')
    
    columns = pd.Index(['realgdp', 'infl', 'unemp'], name='item')
    
    # 修改列索引名
    data = data.reindex(columns=columns)
    
    data.index = periods.to_timestamp('D', 'end')
    
    ldata = data.stack().reset_index().rename(columns={0:'value'})
    
    
    ldata[:10]
    
    date item value
    0 1959-03-31 realgdp 2710.349
    1 1959-03-31 infl 0.000
    2 1959-03-31 unemp 5.800
    3 1959-06-30 realgdp 2778.801
    4 1959-06-30 infl 2.340
    5 1959-06-30 unemp 5.100
    6 1959-09-30 realgdp 2775.488
    7 1959-09-30 infl 2.740
    8 1959-09-30 unemp 5.300
    9 1959-12-31 realgdp 2785.204

    This is so-called long format for multiple time series, or other observational data with two or more keys. Each row in the table represents a single observation.

    Data is frequently stored this way in relational databases like MySQL, as a fixed schema allows the number of distinct values in the item columns to change as data is added to the table. In the previous example, date and keys offering both relational integrity and easier joins. In some cases, the data may be more difficult to work with in this format; you might prefer to have a DataFrame containing one column per distinct item value indexed by timestamps in the date column. DataFrame's pivot method performs exactly this transformation:

    pivoted = ldata.pivot('date', 'item', 'value')
    
    pivoted[:5]
    
    item infl realgdp unemp
    date
    1959-03-31 0.00 2710.349 5.8
    1959-06-30 2.34 2778.801 5.1
    1959-09-30 2.74 2775.488 5.3
    1959-12-31 0.27 2785.204 5.6
    1960-03-31 2.31 2847.699 5.2

    The first two values passed are the columns to be used respectively as the row and column index, then finally an optional value column to fill the DataFrame. Suppose you had two value columns that you wanted to reshape simultaneously:

    ldata['valu2'] = np.random.randn(len(ldata))
    
    ldata[:10]
    
    date item value valu2
    0 1959-03-31 realgdp 2710.349 -0.143460
    1 1959-03-31 infl 0.000 -0.422318
    2 1959-03-31 unemp 5.800 0.389872
    3 1959-06-30 realgdp 2778.801 -0.208526
    4 1959-06-30 infl 2.340 -1.538956
    5 1959-06-30 unemp 5.100 -0.143273
    6 1959-09-30 realgdp 2775.488 0.385763
    7 1959-09-30 infl 2.740 0.564365
    8 1959-09-30 unemp 5.300 0.266295
    9 1959-12-31 realgdp 2785.204 -1.267871

    By omitting the last argument, you obtain a DataFrame with hierarchical columns:

    Wide to Long

    An inverse operation to pivot for DataFrame is pandas.melt. Rather than transroming one columns into many in a new DataFrame, it merges multiple columns into one, producing a DataFrame that is longer than the input, Let's look at an example:

    df = pd.DataFrame({
        'key': ['foo', 'bar', 'baz'],
        'A':[1,2,3],
        'B':[4,5,6],
        'C':[7,8,9]
    })
    
    df
    
    key A B C
    0 foo 1 4 7
    1 bar 2 5 8
    2 baz 3 6 9

    The 'key' columns may be a group indicator, and the other columns are data values. When using pandas.melt, we must indicate which colmuns are group indicators Let's use 'key' as the only group indicator here:

    melted = pd.melt(df, ['key'])
    
    melted
    
    key variable value
    0 foo A 1
    1 bar A 2
    2 baz A 3
    3 foo B 4
    4 bar B 5
    5 baz B 6
    6 foo C 7
    7 bar C 8
    8 baz C 9

    Using pivot, we can reshape back to the original layout:(布局)

    reshaped = melted.pivot('key', 'variable', 'value')
    
    reshaped
    
    variable A B C
    key
    bar 2 5 8
    baz 3 6 9
    foo 1 4 7

    Since the result of pivot creats an index from the column used as the row labels, we may want to use reset_index to move the data back into a column:

    reshaped.reset_index()
    
    variable key A B C
    0 bar 2 5 8
    1 baz 3 6 9
    2 foo 1 4 7

    You can also specify a subset of columns to use as value columns:

    pd.melt(df, id_vars=['key'], value_vars=['A', 'B'])
    
    key variable value
    0 foo A 1
    1 bar A 2
    2 baz A 3
    3 foo B 4
    4 bar B 5
    5 baz B 6

    pandas.melt can be used without any group identifiers, too:

    pd.melt(df, value_vars=['A', 'B', 'C'])
    
    variable value
    0 A 1
    1 A 2
    2 A 3
    3 B 4
    4 B 5
    5 B 6
    6 C 7
    7 C 8
    8 C 9
    pd.melt(df, value_vars=['key', 'A', 'B'])
    
    variable value
    0 key foo
    1 key bar
    2 key baz
    3 A 1
    4 A 2
    5 A 3
    6 B 4
    7 B 5
    8 B 6

    小结

    Now that you have some pandas basics for data import, clearning, and reorganization under your belt, we are ready to move on to data visualization with matplotlib. We will return to pandas later in the book when we discuss more advance analytics.

  • 相关阅读:
    linux查看cpu、内存信息
    PHP之路,Day1
    Zabbix3.0完整部署
    linux时间同步
    nginx日志切割脚本
    Rsync+sersync文件实时同步
    阿里云自动挂载云盘脚本
    nginx不支持pathinfo 导致thinkphp出错解决办法
    VIM选项配置说明
    vagrant 本地添加box 支持带版本号
  • 原文地址:https://www.cnblogs.com/chenjieyouge/p/11945169.html
Copyright © 2011-2022 走看看