zoukankan      html  css  js  c++  java
  • Pandas学习笔记

    1. 数据结构

    Pandas主要有三种数据:

    • Series(一维数据,大小不可变)
    • DataFrame(二维数据,大小可变)
    • Panel(三维数据,大小可变)

    Series

    具有均匀数据的一维数组结构。例如1,3,5,7,...的集合

    1 3 5 7 ...

     

     

    关键点

    • 均匀数据
    • 尺寸大小不变
    • 数据值可变

    DataFrame

    具有异构数据的二维数据。例如

    姓名 年龄 性别
    小明 20
    小红 15
    小刚 18

     

     

     

     

     

     

    关键点

    • 异构数据
    • 大小可变
    • 数据可变

    Panel

    具有异构数据的三维数据结构,可以说成是DataFrame的容器。

    关键点

    • 异构数据
    • 大小可变
    • 数据可变

    2. Series

    Series是能够保存任何类型的数据(整型,字符串,浮点数,python对象等)的一维标记数据。

    构造函数

    pandas.Series(data, index, dtype, copy)

    参数 描述
    data 数据采取各种形式,如:ndarray,list,constants
    index 索引值必须是唯一的和散列的,与数据的长度相同。默认np.arange(n)如果没有索引被传递。
    dtype 用于数据类型。如果没有,将推断数据类型。
    copy 复制数据,默认为false

     

     

     

     

     

     

     

    构建一个空的Series

    1 import pandas as pd
    2 s=pd.Series()
    3 print(s)

    输出

    Series([], dtype: float64)

    如果数据是ndarray,则传递的索引必须具有相同的长度。如果没有传递索引值,那么默认索引是(0 - n-1)

    1 import pandas as pd
    2 import numpy as np
    3 data = np.array(['a','b','c','d'])
    4 s = pd.Series(data)
    5 print(s)

    输出

    0    a
    1    b
    2    c
    3    d
    dtype: object
    
    1 import pandas as pd
    2 import numpy as np
    3 data = np.array(['a','b','c','d'])
    4 s = pd.Series(data,index=[100,101,102,103])
    5 print(s)

    输出

    100    a
    101    b
    102    c
    103    d
    dtype: object

    从字典(dict)创建一个Series,没有指定索引,则使用字典键作为索引,如果指定索引则使用指定的索引值。

    1 import pandas as pd
    2 import numpy as np
    3 data = {'a' : 0., 'b' : 1., 'c' : 2.}
    4 s = pd.Series(data)
    5 print(s)
     输出
    a    0.0
    b    1.0
    c    2.0
    dtype: float64
    1 import pandas as pd
    2 import numpy as np
    3 data = {'a' : 0., 'b' : 1., 'c' : 2.}
    4 s = pd.Series(data,index=['b','c','d','a'])
    5 print(s)

    输出

    b    1.0
    c    2.0
    d    NaN
    a    0.0
    dtype: float64

    从标量创建一个系列,如果数据是标量值,则必须提供索引。如果索引长度超过数据长度,则将重复该值以匹配索引的长度。

    1 import pandas as pd
    2 import numpy as np
    3 s = pd.Series(5, index=[0, 1, 2, 3])
    4 print(s)

    输出

    0    5
    1    5
    2    5
    3    5
    dtype: int64

    从具有位置的Series中访问数据,Series中的数据可以使用类似访问ndarray中的数据来访问。

    1 import pandas as pd
    2 s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
    3 print(s)
    4 print(s[0])

    输出

    a    1
    b    2
    c    3
    d    4
    e    5
    dtype: int64
    1
    
    1 import pandas as pd
    2 s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
    3 print(s[:3])

    输出

    a    1
    b    2
    c    3
    dtype: int64
    
    1 import pandas as pd
    2 s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
    3 print(s[-3:])

    输出

    c    3
    d    4
    e    5
    dtype: int64

    使用标签检索数据,通过索引标签获取和设置值。

    1 import pandas as pd
    2 s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
    3 print(s['a'])

    输出

    1
    
    1 import pandas as pd
    2 s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
    3 print(s[['a','c','d']])

    输出

    a    1
    c    3
    d    4
    dtype: int64

    如果不包含标签,则会出项异常。

    3. DataFrame

    pandas.DataFrame(data, index, columns, dtype, copy)

    构造函数的参数:

    参数 描述
    data 数据采取各种形式,如:ndarray,series,map,lists,dict,constant和DataFrame。
    index 对于行标签
    columns 对于列标签
    dtype 每列的数据类型
    copy 默认值为False

     

     

     

     

     

     

     

     

    创建一个空的DataFrame

    1 import pandas as pd
    2 df = pd.DataFrame()
    3 print(df)

    输出

    Empty DataFrame
    Columns: []
    Index: []

    从列表创建DataFrame

    1 import pandas as pd
    2 data = [1,2,3,4,5]
    3 df = pd.DataFrame(data)
    4 print(df)

    输出

       0
    0  1
    1  2
    2  3
    3  4
    4  5
    
    1 import pandas as pd
    2 data = [['Alex',10],['Bob',12],['Clarke',13]]
    3 df = pd.DataFrame(data,columns=['Name','Age'])
    4 print(df)

    输出

         Name  Age
    0    Alex   10
    1     Bob   12
    2  Clarke   13
    
    1 import pandas as pd
    2 data = [['Alex',10],['Bob',12],['Clarke',13]]
    3 df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
    4 print(df)

    输出

         Name   Age
    0    Alex  10.0
    1     Bob  12.0
    2  Clarke  13.0

    从ndarray/Lists的字典来创建DataFrame,所有的ndarrays必须具有相同的长度,如果传递了索引,则索引的长度应等于数组的长度,如果没有则使用默认索引。

    1 import pandas as pd
    2 data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
    3 df = pd.DataFrame(data)
    4 print(df)

    输出

        Name  Age
    0    Tom   28
    1   Jack   34
    2  Steve   29
    3  Ricky   42

     使用数组创建一个索引的DataFrame

    1 import pandas as pd
    2 data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
    3 df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
    4 print(df)

    输出

            Name  Age
    rank1    Tom   28
    rank2   Jack   34
    rank3  Steve   29
    rank4  Ricky   42

     从列表创建DataFrame,字典和列表可作为输入数据传递以用来创建DataFrame,字典键默认为列名。

    1 import pandas as pd
    2 data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
    3 df = pd.DataFrame(data)
    4 print(df)

    输出

       a   b     c
    0  1   2   NaN
    1  5  10  20.0

     使用字典,行索引和列索引创建DataFrame

    1 import pandas as pd
    2 data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
    3 df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])
    4 df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
    5 print(df1)
    6 print(df2)

    输出

            a   b
    first   1   2
    second  5  10
            a  b1
    first   1 NaN
    second  5 NaN

    字典的Series可以传递形成一个DataFrame,得到的索引是所有Series索引的并集

    1 import pandas as pd
    2 d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
    3       'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
    4 df = pd.DataFrame(d)
    5 print(df)

    输出

       one  two
    a  1.0    1
    b  2.0    2
    c  3.0    3
    d  NaN    4

    列选择

    1 import pandas as pd
    2 d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
    3       'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
    4 df = pd.DataFrame(d)
    5 print(df ['one'])

    输出

    a    1.0
    b    2.0
    c    3.0
    d    NaN
    Name: one, dtype: float64

    列添加

     1 import pandas as pd
     2 d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
     3       'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
     4 df = pd.DataFrame(d)
     5 print ("Adding a new column by passing as Series:")
     6 df['three']=pd.Series([10,20,30],index=['a','b','c'])
     7 print(df)
     8 print ("Adding a new column using the existing columns in DataFrame:")
     9 df['four']=df['one']+df['three']
    10 print(df)

    输出

    Adding a new column by passing as Series:
       one  two  three
    a  1.0    1   10.0
    b  2.0    2   20.0
    c  3.0    3   30.0
    d  NaN    4    NaN
    Adding a new column using the existing columns in DataFrame:
       one  two  three  four
    a  1.0    1   10.0  11.0
    b  2.0    2   20.0  22.0
    c  3.0    3   30.0  33.0
    d  NaN    4    NaN   NaN

    列删除

     1 import pandas as pd
     2 d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
     3      'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 
     4      'three' : pd.Series([10,20,30], index=['a','b','c'])}
     5 df = pd.DataFrame(d)
     6 print ("Our dataframe is:")
     7 print(df)
     8 print ("Deleting the first column using DEL function:")
     9 del df['one']
    10 print(df)
    11 print ("Deleting another column using POP function:")
    12 df.pop('two')
    13 print(df)

    输出

    Our dataframe is:
       one  two  three
    a  1.0    1   10.0
    b  2.0    2   20.0
    c  3.0    3   30.0
    d  NaN    4    NaN
    Deleting the first column using DEL function:
       two  three
    a    1   10.0
    b    2   20.0
    c    3   30.0
    d    4    NaN
    Deleting another column using POP function:
       three
    a   10.0
    b   20.0
    c   30.0
    d    NaN

    行选择,添加和删除

     1 import pandas as pd
     2 d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
     3      'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
     4 df = pd.DataFrame(d)
     5 print(df)
     6 print('---------')
     7 print(df.loc['a'])
     8 print('---------')
     9 print(df.iloc[2])
    10 print('---------')
    11 print(df[2:4])
    12 print('---------')
    13 df2=pd.DataFrame([[5,6],[7,8]],index=['e','f'],columns=['one','two'])
    14 df=df.append(df2)
    15 print(df)
    16 df=df.drop('a')
    17 print('---------')
    18 print(df)

    输出

       one  two
    a  1.0    1
    b  2.0    2
    c  3.0    3
    d  NaN    4
    ---------
    one    1.0
    two    1.0
    Name: a, dtype: float64
    ---------
    one    3.0
    two    3.0
    Name: c, dtype: float64
    ---------
       one  two
    c  3.0    3
    d  NaN    4
    ---------
       one  two
    a  1.0    1
    b  2.0    2
    c  3.0    3
    d  NaN    4
    e  5.0    6
    f  7.0    8
    ---------
       one  two
    b  2.0    2
    c  3.0    3
    d  NaN    4
    e  5.0    6
    f  7.0    8
    

    4. Panel

    pandas.Panel(data, items, major_axis, minor_axis, dtype, copy)

    参数 描述
    data 数据采取各种形式,如:ndarray, series, map, lists, dict, constant和DataFrame
    items axis=0
    major_axis axis=1
    minor_axis axis=2
    dtype 每列的数据类型
    copy 复制数据

     

     

     

     

     

     

     

     

     

     

     

    创建panel和选择数据

     1 print('--------creat an empty panel---------')
     2 import pandas as pd
     3 p=pd.Panel()
     4 print(p)
     5 print('-------------end---------------------')
     6 print('---creat an panel from 3D ndarray----')
     7 import pandas as pd
     8 import numpy as np
     9 data = np.random.rand(2,4,5)
    10 p = pd.Panel(data)
    11 print(p)
    12 print('-------------end---------------------')
    13 print('-creat an panel from dict(DataFrame)-')
    14 import pandas as pd
    15 import numpy as np
    16 data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)), 
    17         'Item2' : pd.DataFrame(np.random.randn(4, 2))}
    18 p = pd.Panel(data)
    19 print(p)
    20 print('-------------end---------------------')
    21 print('-------select data from panel--------')
    22 import pandas as pd
    23 import numpy as np
    24 data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)), 
    25         'Item2' : pd.DataFrame(np.random.randn(4, 2))}
    26 p = pd.Panel(data)
    27 print(p['Item1'])
    28 print('-------------end---------------------')
    29 print('-----select data use major_axis------')
    30 import pandas as pd
    31 import numpy as np
    32 data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)), 
    33         'Item2' : pd.DataFrame(np.random.randn(4, 2))}
    34 p = pd.Panel(data)
    35 print(p.major_xs(1))
    36 print('-------------end---------------------')
    37 print('-----select data use minor_axis------')
    38 import pandas as pd
    39 import numpy as np
    40 data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)), 
    41         'Item2' : pd.DataFrame(np.random.randn(4, 2))}
    42 p = pd.Panel(data)
    43 print(p.minor_xs(1))
    44 print('-------------end---------------------')

    输出

    --------creat an empty panel---------
    <class 'pandas.core.panel.Panel'>
    Dimensions: 0 (items) x 0 (major_axis) x 0 (minor_axis)
    Items axis: None
    Major_axis axis: None
    Minor_axis axis: None
    -------------end---------------------
    ---creat an panel from 3D ndarray----
    <class 'pandas.core.panel.Panel'>
    Dimensions: 2 (items) x 4 (major_axis) x 5 (minor_axis)
    Items axis: 0 to 1
    Major_axis axis: 0 to 3
    Minor_axis axis: 0 to 4
    -------------end---------------------
    -creat an panel from dict(DataFrame)-
    <class 'pandas.core.panel.Panel'>
    Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
    Items axis: Item1 to Item2
    Major_axis axis: 0 to 3
    Minor_axis axis: 0 to 2
    -------------end---------------------
    -------select data from panel--------
              0         1         2
    0 -0.960065 -1.114559 -0.296025
    1 -0.382277 -0.585262  1.503437
    2  1.315953 -0.350967 -0.711729
    3  0.959712  0.800819 -0.673261
    -------------end---------------------
    -----select data use major_axis------
          Item1     Item2
    0 -1.742578 -0.697723
    1 -0.156266  0.003577
    2  0.023405       NaN
    -------------end---------------------
    -----select data use minor_axis------
          Item1     Item2
    0  1.103015  0.488929
    1 -0.391214 -0.030208
    2  1.783799  0.039654
    3 -1.863803 -0.949056
    -------------end---------------------

    5. 基本功能

    Series基本功能

    属性或方法 描述
    axes 返回行轴标签列表。
    dtype 返回对象的数据类型。
    empty 检查是否为空,返回布尔型。
    ndim 返回底层数据的维数,默认定义:1。
    size 返回基础数据中的元素数。
    values 将Series作为ndarray放回。
    head(n) 放回前n行。
    tail(n) 放回最后n行。

     

     

     

     

     

     

     

     

     

     

     

     

     1 import pandas as pd
     2 import numpy as np
     3 s = pd.Series(np.random.randn(4))
     4 print(s)
     5 print('-------------')
     6 print("The axes are:")
     7 print(s.axes)
     8 print('-------------')
     9 print ("Is the Object empty?")
    10 print(s.empty)
    11 print('-------------')
    12 print("The dimensions of the object:")
    13 print(s.ndim)
    14 print('-------------')
    15 print("The size of the object:")
    16 print(s.size)
    17 print('-------------')
    18 print("The actual data series is:")
    19 print(s.values)
    20 print('-------------')
    21 print("The first two rows of the data series:")
    22 print(s.head(2))
    23 print('-------------')
    24 print("The last two rows of the data series:")
    25 print(s.tail(2))

    输出

    0   -1.478084
    1    0.468882
    2    0.394107
    3    0.682990
    dtype: float64
    -------------
    The axes are:
    [RangeIndex(start=0, stop=4, step=1)]
    -------------
    Is the Object empty?
    False
    -------------
    The dimensions of the object:
    1
    -------------
    The size of the object:
    4
    -------------
    The actual data series is:
    [-1.47808355  0.46888222  0.3941075   0.68299036]
    -------------
    The first two rows of the data series:
    0   -1.478084
    1    0.468882
    dtype: float64
    -------------
    The last two rows of the data series:
    2    0.394107
    3    0.682990
    dtype: float64

    DataFrame基本功能

    属性或方法 描述
    T 转置行和列。
    axes 返回一个列,行轴标签和列轴标签作为唯一的成员。
    dtypes 放回此对象中的数据类型。
    empty 检查是否为空,返回布尔型。
    ndim 轴/数组维度大小。
    shape 返回表示DataFrame的维度的元组。
    size 尺寸
    values ndarray表示返回。
    head() 放回开头前n行。
    tail() 返回最后n行。

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     1 print('---------creat a DataFrame----------')
     2 import pandas as pd
     3 import numpy as np
     4 d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack']),
     5    'Age':pd.Series([25,26,25,23,30,29,23]),
     6    'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
     7 df = pd.DataFrame(d)
     8 print("Our data series is:")
     9 print(df)
    10 print('----------------end-----------------')
    11 print('--the transpose of the data series--')
    12 print(df.T)
    13 print('----------------end-----------------')
    14 print('-----row and column axis labels-----')
    15 print(df.axes)
    16 print('----------------end-----------------')
    17 print('---the data types of each column----')
    18 print(df.dtypes)
    19 print('----------------end-----------------')
    20 print('---------is the object empty--------')
    21 print(df.empty)
    22 print('----------------end-----------------')
    23 print('-----------the dimension------------')
    24 print(df.ndim)
    25 print('----------------end-----------------')
    26 print('--------------the shape-------------')
    27 print(df.shape)
    28 print('----------------end-----------------')
    29 print('--------------the shape-------------')
    30 print(df.shape)
    31 print('----------------end-----------------')
    32 print('------total number of elements------')
    33 print(df.size)
    34 print('----------------end-----------------')
    35 print('-------------actual data------------')
    36 print(df.values)
    37 print('----------------end-----------------')
    38 print('-------first two rows of data-------')
    39 print(df.head(2))
    40 print('----------------end-----------------')
    41 print('--------last two rows of data-------')
    42 print(df.tail(2))
    43 print('----------------end-----------------')

    输出

    ---------creat a DataFrame----------
    Our data series is:
        Name  Age  Rating
    0    Tom   25    4.23
    1  James   26    3.24
    2  Ricky   25    3.98
    3    Vin   23    2.56
    4  Steve   30    3.20
    5  Minsu   29    4.60
    6   Jack   23    3.80
    ----------------end-----------------
    --the transpose of the data series--
               0      1      2     3      4      5     6
    Name     Tom  James  Ricky   Vin  Steve  Minsu  Jack
    Age       25     26     25    23     30     29    23
    Rating  4.23   3.24   3.98  2.56    3.2    4.6   3.8
    ----------------end-----------------
    -----row and column axis labels-----
    [RangeIndex(start=0, stop=7, step=1), Index(['Name', 'Age', 'Rating'], dtype='object')]
    ----------------end-----------------
    ---the data types of each column----
    Name       object
    Age         int64
    Rating    float64
    dtype: object
    ----------------end-----------------
    ---------is the object empty--------
    False
    ----------------end-----------------
    -----------the dimension------------
    2
    ----------------end-----------------
    --------------the shape-------------
    (7, 3)
    ----------------end-----------------
    --------------the shape-------------
    (7, 3)
    ----------------end-----------------
    ------total number of elements------
    21
    ----------------end-----------------
    -------------actual data------------
    [['Tom' 25 4.23]
     ['James' 26 3.24]
     ['Ricky' 25 3.98]
     ['Vin' 23 2.56]
     ['Steve' 30 3.2]
     ['Minsu' 29 4.6]
     ['Jack' 23 3.8]]
    ----------------end-----------------
    -------first two rows of data-------
        Name  Age  Rating
    0    Tom   25    4.23
    1  James   26    3.24
    ----------------end-----------------
    --------last two rows of data-------
        Name  Age  Rating
    5  Minsu   29     4.6
    6   Jack   23     3.8
    ----------------end-----------------

     6. 描述性统计

    函数 描述
    sum() 返回所请求轴的值的总和,默认axis=0
    mean() 返回平均值
    std() 返回标准差
    median() 所有值的中位数
    mode() 值的模值
    min() 最小值
    max() 最大值
    abs() 绝对值
    prod() 数组元素的乘积
    cumsum() 累计总和
    cumprod() 累计乘积
    describe() 计算统计信息的摘要,object-汇总字符串,number-汇总数字,all-汇总所有列

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     1 print('--------creat a DataFrame---------')
     2 import pandas as pd
     3 import numpy as np
     4 d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Minsu','Jack',
     5    'Lee','David','Gasper','Betina','Andres']),
     6    'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
     7    'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}
     8 df = pd.DataFrame(d)
     9 print(df)
    10 print('---------------end----------------')
    11 print('---------------sum----------------')
    12 print(df.sum())
    13 print('---------------end----------------')
    14 print(df.sum(1))
    15 print('---------------end----------------')
    16 print('--------------mean----------------')
    17 print(df.mean())
    18 print('---------------end----------------')
    19 print('--------------std----------------')
    20 print(df.std())
    21 print('---------------end----------------')
    22 print('------------describe--------------')
    23 print(df.describe())
    24 print('---------------end----------------')

    输出

    --------creat a DataFrame---------
          Name  Age  Rating
    0      Tom   25    4.23
    1    James   26    3.24
    2    Ricky   25    3.98
    3      Vin   23    2.56
    4    Steve   30    3.20
    5    Minsu   29    4.60
    6     Jack   23    3.80
    7      Lee   34    3.78
    8    David   40    2.98
    9   Gasper   30    4.80
    10  Betina   51    4.10
    11  Andres   46    3.65
    ---------------end----------------
    ---------------sum----------------
    Name      TomJamesRickyVinSteveMinsuJackLeeDavidGasperBe...
    Age                                                     382
    Rating                                                44.92
    dtype: object
    ---------------end----------------
    0     29.23
    1     29.24
    2     28.98
    3     25.56
    4     33.20
    5     33.60
    6     26.80
    7     37.78
    8     42.98
    9     34.80
    10    55.10
    11    49.65
    dtype: float64
    ---------------end----------------
    --------------mean----------------
    Age       31.833333
    Rating     3.743333
    dtype: float64
    ---------------end----------------
    --------------std----------------
    Age       9.232682
    Rating    0.661628
    dtype: float64
    ---------------end----------------
    ------------describe--------------
                 Age     Rating
    count  12.000000  12.000000
    mean   31.833333   3.743333
    std     9.232682   0.661628
    min    23.000000   2.560000
    25%    25.000000   3.230000
    50%    29.500000   3.790000
    75%    35.500000   4.132500
    max    51.000000   4.800000
    ---------------end----------------
    

     

     7. 函数应用

    • 合理函数应用:pipe()
    • 行或列函数应用:apply()
    • 元素函数应用:applymap()

    通过将函数和适当数量的参数作为管道参数来执行自定义操作。

     

     1 import pandas as pd
     2 import numpy as np
     3 def adder(ele1,ele2):
     4     return ele1+ele2
     5 df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
     6 print(df)
     7 print('---------------end----------------')
     8 print(df.pipe(adder,2))
     9 print('---------------end----------------')
    10 print(df.apply(np.mean))
    11 print('---------------end----------------')
    12 print(df.apply(np.mean,axis=1))
    13 print('---------------end----------------')
    14 print(df.apply(lambda x:x.max()-x.min()))
    15 print('---------------end----------------')
    16 print(df['col1'].map(lambda x:x*100))
    17 print('---------------end----------------')
    18 print(df.applymap(lambda x:x*100))
    19 print('---------------end----------------')

    输出

           col1      col2      col3
    0  1.689749  0.959856  1.074871
    1 -0.392017  0.001075  0.806392
    2 -0.484529  0.635483  0.644830
    3 -0.049649  0.113976 -0.220698
    4  1.413197 -0.576231 -0.075871
    ---------------end----------------
           col1      col2      col3
    0  3.689749  2.959856  3.074871
    1  1.607983  2.001075  2.806392
    2  1.515471  2.635483  2.644830
    3  1.950351  2.113976  1.779302
    4  3.413197  1.423769  1.924129
    ---------------end----------------
    col1    0.435350
    col2    0.226832
    col3    0.445905
    dtype: float64
    ---------------end----------------
    0    1.241492
    1    0.138483
    2    0.265261
    3   -0.052123
    4    0.253698
    dtype: float64
    ---------------end----------------
    col1    2.174278
    col2    1.536088
    col3    1.295569
    dtype: float64
    ---------------end----------------
    0    168.974915
    1    -39.201732
    2    -48.452922
    3     -4.964864
    4    141.319700
    Name: col1, dtype: float64
    ---------------end----------------
             col1       col2        col3
    0  168.974915  95.985614  107.487138
    1  -39.201732   0.107497   80.639193
    2  -48.452922  63.548250   64.483009
    3   -4.964864  11.397646  -22.069797
    4  141.319700 -57.623138   -7.587075
    ---------------end----------------
    

     8. 重建索引

     重新索引会更改DataFrame的行标签和列标签,重新索引意味着符合数据以匹配特定轴上的一组给定的标签。

    •  重新排序现有数据以匹配一组新的标签
    • 在没有标签数据的标签位置插入缺失值(NA)标记

     

     1 import pandas as pd
     2 import numpy as np
     3 N=20
     4 df = pd.DataFrame({
     5    'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
     6    'x': np.linspace(0,stop=N-1,num=N),
     7    'y': np.random.rand(N),
     8    'C': np.random.choice(['Low','Medium','High'],N).tolist(),
     9    'D': np.random.normal(100, 10, size=(N)).tolist()
    10 })
    11 df_reindexed = df.reindex(index=[0,2,5], columns=['A', 'C', 'B'])
    12 print(df_reindexed)

    输出

               A       C   B
    0 2016-01-01    High NaN
    2 2016-01-03  Medium NaN
    5 2016-01-06  Medium NaN

     重建索引与其他对象对齐

     

    1 import pandas as pd
    2 import numpy as np
    3 df1 = pd.DataFrame(np.random.randn(10,3),columns=['col1','col2','col3'])
    4 df2 = pd.DataFrame(np.random.randn(7,3),columns=['col1','col2','col3'])
    5 df1 = df1.reindex_like(df2)
    6 print(df1)

    输出

           col1      col2      col3
    0  0.533272  1.462343  1.958989
    1  0.822496  1.020661 -0.958452
    2  0.583271  1.100357  0.405649
    3 -0.617700 -0.444208  0.921092
    4 -0.883714 -0.068178  1.507545
    5 -0.696816  0.729113 -0.509259
    6 -0.127911 -0.255686 -1.378398

     填充时重新加注

    •  pad/ffill - 向前填充值
    • bfill/backfill - 向后填充值
    • nearest - 从最近的索引值填充
    1 import pandas as pd
    2 import numpy as np
    3 df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
    4 df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])
    5 print(df2.reindex_like(df1))
    6 print("Data Frame with Forward Fill:")
    7 print(df2.reindex_like(df1,method='ffill'))

    输出

           col1      col2      col3
    0  0.518742  0.162080  1.606103
    1 -0.355712  2.200266  1.072651
    2       NaN       NaN       NaN
    3       NaN       NaN       NaN
    4       NaN       NaN       NaN
    5       NaN       NaN       NaN
    Data Frame with Forward Fill:
           col1      col2      col3
    0  0.518742  0.162080  1.606103
    1 -0.355712  2.200266  1.072651
    2 -0.355712  2.200266  1.072651
    3 -0.355712  2.200266  1.072651
    4 -0.355712  2.200266  1.072651
    5 -0.355712  2.200266  1.072651

     重建索引时的填充限制,限制参数在重建索引时提供对填充的额外控制。

     

    1 import pandas as pd
    2 import numpy as np
    3 df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
    4 df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])
    5 print(df2.reindex_like(df1))
    6 print("Data Frame with Forward Fill limiting to 1:")
    7 print(df2.reindex_like(df1,method='ffill',limit=1))

    输出

           col1      col2      col3
    0  0.550406  0.220336 -0.733154
    1  0.372353  0.978386  1.202727
    2       NaN       NaN       NaN
    3       NaN       NaN       NaN
    4       NaN       NaN       NaN
    5       NaN       NaN       NaN
    Data Frame with Forward Fill limiting to 1:
           col1      col2      col3
    0  0.550406  0.220336 -0.733154
    1  0.372353  0.978386  1.202727
    2  0.372353  0.978386  1.202727
    3       NaN       NaN       NaN
    4       NaN       NaN       NaN
    5       NaN       NaN       NaN
    

     重命名

    1 import pandas as pd
    2 import numpy as np
    3 df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
    4 print(df1)
    5 print("After renaming the rows and columns:")
    6 print(df1.rename(columns={'col1' : 'c1', 'col2' : 'c2'},index = {0 : 'apple', 1 : 'banana', 2 : 'durian'}))

    输出

           col1      col2      col3
    0  0.162944 -0.257846 -0.890368
    1 -0.969776  1.685473 -1.330109
    2 -1.271563 -0.375700  0.778564
    3 -1.123660  0.849679  0.436355
    4  0.321475  0.779693 -2.100270
    5 -1.184636 -0.206975  0.941504
    After renaming the rows and columns:
                  c1        c2      col3
    apple   0.162944 -0.257846 -0.890368
    banana -0.969776  1.685473 -1.330109
    durian -1.271563 -0.375700  0.778564
    3      -1.123660  0.849679  0.436355
    4       0.321475  0.779693 -2.100270
    5      -1.184636 -0.206975  0.941504
    

     9. 迭代

     

     1 import pandas as pd
     2 import numpy as np
     3 N=20
     4 df = pd.DataFrame({
     5     'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
     6     'x': np.linspace(0,stop=N-1,num=N),
     7     'y': np.random.rand(N),
     8     'C': np.random.choice(['Low','Medium','High'],N).tolist(),
     9     'D': np.random.normal(100, 10, size=(N)).tolist()
    10     })
    11 for col in df:
    12     print(col)

    输出

    A
    x
    y
    C
    D

    要遍历DataFrame中的行,可以使用以下函数

    • iteritems() - 迭代(key, value)对
    • iterrows() - 将行迭代为(索引,Series)对
    • itertuples() - 以namedtuples的形式迭代行
     1 import pandas as pd
     2 import numpy as np
     3 df = pd.DataFrame(np.random.randn(4,3),columns=['col1','col2','col3'])
     4 print('------------iteritems--------------')
     5 for key,value in df.iteritems():
     6     print(key,value)
     7 print('----------------end----------------')
     8 print('-------------iterrows--------------')
     9 for row_index,row in df.iterrows():
    10     print(row_index,row)
    11 print('----------------end----------------')
    12 print('-------------itertuples------------')
    13 for row in df.itertuples():
    14     print(row)
    15 print('----------------end----------------')

    输出

    ------------iteritems--------------
    col1 0   -0.453626
    1   -1.555137
    2    1.209289
    3    0.238345
    Name: col1, dtype: float64
    col2 0   -0.309713
    1   -0.018258
    2    0.326646
    3    1.584639
    Name: col2, dtype: float64
    col3 0   -1.746411
    1    0.144020
    2    0.932400
    3   -0.848700
    Name: col3, dtype: float64
    ----------------end----------------
    -------------iterrows--------------
    0 col1   -0.453626
    col2   -0.309713
    col3   -1.746411
    Name: 0, dtype: float64
    1 col1   -1.555137
    col2   -0.018258
    col3    0.144020
    Name: 1, dtype: float64
    2 col1    1.209289
    col2    0.326646
    col3    0.932400
    Name: 2, dtype: float64
    3 col1    0.238345
    col2    1.584639
    col3   -0.848700
    Name: 3, dtype: float64
    ----------------end----------------
    -------------itertuples------------
    Pandas(Index=0, col1=-0.453625680715928, col2=-0.30971276978094636, col3=-1.7464111236386397)
    Pandas(Index=1, col1=-1.5551365938912898, col2=-0.018257622785818713, col3=0.1440202346073698)
    Pandas(Index=2, col1=1.2092886777094904, col2=0.3266461576970751, col3=0.9323998460902878)
    Pandas(Index=3, col1=0.23834535595475798, col2=1.5846386089382405, col3=-0.8486996087036667)
    ----------------end----------------
    

    10. 排序

    sort_values()提供了mergeesort,heapsort和quicksort的配置。

     1 import pandas as pd
     2 import numpy as np
     3 unsorted_df=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns=['col2','col1'])
     4 print(unsorted_df)
     5 print('---------按标签排序----------')
     6 sorted_df=unsorted_df.sort_index()
     7 print(sorted_df)
     8 print('--------改变排序顺序---------')
     9 sorted_df = unsorted_df.sort_index(ascending=False)
    10 print(sorted_df)
    11 print('----------按列排序-----------')
    12 sorted_df=unsorted_df.sort_index(axis=1)
    13 print(sorted_df)
    14 print('----------按值排序-----------')
    15 sorted_df = unsorted_df.sort_values(by='col1')
    16 print(sorted_df)

    输出

           col2      col1
    1  0.295840 -0.880007
    4  0.151129  1.843255
    6 -0.516764  0.195839
    2 -0.040592  0.582046
    3  1.806547 -0.760579
    5 -1.366668  0.652985
    9 -1.180956  1.198587
    8 -1.621409 -0.555094
    0  0.403722  0.296659
    7  0.520232 -0.759177
    ---------按标签排序----------
           col2      col1
    0  0.403722  0.296659
    1  0.295840 -0.880007
    2 -0.040592  0.582046
    3  1.806547 -0.760579
    4  0.151129  1.843255
    5 -1.366668  0.652985
    6 -0.516764  0.195839
    7  0.520232 -0.759177
    8 -1.621409 -0.555094
    9 -1.180956  1.198587
    --------改变排序顺序---------
           col2      col1
    9 -1.180956  1.198587
    8 -1.621409 -0.555094
    7  0.520232 -0.759177
    6 -0.516764  0.195839
    5 -1.366668  0.652985
    4  0.151129  1.843255
    3  1.806547 -0.760579
    2 -0.040592  0.582046
    1  0.295840 -0.880007
    0  0.403722  0.296659
    ----------按列排序-----------
           col1      col2
    1 -0.880007  0.295840
    4  1.843255  0.151129
    6  0.195839 -0.516764
    2  0.582046 -0.040592
    3 -0.760579  1.806547
    5  0.652985 -1.366668
    9  1.198587 -1.180956
    8 -0.555094 -1.621409
    0  0.296659  0.403722
    7 -0.759177  0.520232
    ----------按值排序-----------
           col2      col1
    1  0.295840 -0.880007
    3  1.806547 -0.760579
    7  0.520232 -0.759177
    8 -1.621409 -0.555094
    6 -0.516764  0.195839
    0  0.403722  0.296659
    2 -0.040592  0.582046
    5 -1.366668  0.652985
    9 -1.180956  1.198587
    4  0.151129  1.843255
    

     11. 字符串和文本数据

    函数 描述
    lower() 将Series/Index中的字符串转换为小写
    upper() 将Series/Index中的字符串转换为大写
    len() 计算字符串长度
    strip() 帮助从两侧的Series/索引中的每个字符串中删除空格
    split() 用给定的模式拆分每个字符串
    cat() 使用给定的分隔符连接Series/索引元素
    get_dummies() 返回具有单热编码值的DataFrame
    contains() 如果元素中包含子字符串,则返回每个元素的布尔值
    replace(a,b) 将值a替换为值b
    repeat() 重复每个元素指定的次数
    count() 返回模式中每个元素的出现总数
    startswith() 如果元素以模式开始,则返回true
    endswith() 如果元素以模式结束,则返回true
    find() 返回模式第一次出现的位置
    findall() 返回模式的所有出现的列表
    swapcase() 变换字母大小写
    islower() 是否小写
    isupper() 是否大写
    isnumeric() 是否数字

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     12. 自定义显示选项

    • pd.get_option(param)  #显示默认值
    • pd.set_option(param, value)  #设置默认值
    • pd.reset_option(param)  #重置默认值
    • pd.describe_option(param)  #打印参数的描述
    • pd.option_context(param, value)  #临时设置默认值,退出作用域自动销毁
    参数 描述
    "display.max_rows" 显示的最大行数
    "display.max_columns" 显示的最大列数
    "display.expand_frame_repr" 拉伸页面
    "display.max_colwidth" 显示的最大列宽
    "display.precision" 显示的十进制数的精度

     

     

     

     

     

     

     

     

     

    13. 索引

    • .loc(,)  #基于标签,第一个参数表示行,第二个参数表示列,参数--单标量、列表、范围标签
    • .iloc(,) #基于整数,第一个参数表示行,第二个参数表示列,参数--整数、整数列表、系列值
    • .ix(,)    #混合方法

     

     

     

     

     

     

     

     

     

     

  • 相关阅读:
    三十七、Java基础之JDBC
    三十六、Java基础之File类
    各种IoC框架下实现AOP
    Eclipse导出可执行Jar文件(包含第三方Jar包)
    设计模式(Patterns in Java)-解道
    MyBatis入门示例
    freemarker实例2
    freemarker小例子
    MyEclipse8.6 破解以及注册码
    myeclipse中java文件中文注释乱码问题
  • 原文地址:https://www.cnblogs.com/xbyfight/p/11172071.html
Copyright © 2011-2022 走看看