zoukankan      html  css  js  c++  java
  • pandas基础使用

    这里主要是记录一些pandas的基本方法,熟练使用这里的方法可以放你在处理数据值的时候无往不利。

    一、生成对象

      pandns主要有两种数据结构:series和DataFrame。对着两个两种数据结构的操作的简单的增删改查的操作也在前面的博客里介绍过,有问题的请跳转:https://www.cnblogs.com/ppzhang/p/13747910.html

    二、查看数据

    在这里主要是介绍查看二维数组DataFrame的数据。

      1、head():从上到下查看数据

    import pandas as pd
    from pandas import Series,DataFrame
    import numpy as np
    
    def create_numpay_dataform():
        df2 = pd.DataFrame({'A': 1,
                            'B': pd.Timestamp('20130102'),
                            'C': pd.Series(3, index=list(range(4)), dtype='float32'),
                            'D': np.array([3] * 4, dtype='int32'),
                            'E': pd.Categorical(["test", "train", "test", "train"]),
                            'F': 'foo'})
    
        return df2
    
    def print_data():
    
        df = create_numpay_dataform()
    
        print("-------显示原始数据-----")
        print(df)
        print("-------显示前1行数据-----")
        print(df.head(1))
        print("-------显示前3行数据-----")
        print(df.head(3))
    
    if __name__ == '__main__':
        print_data()
    
    
    #结果如下
    -------显示原始数据-----
       A          B    C  D      E    F
    0  1 2013-01-02  3.0  3   test  foo
    1  1 2013-01-02  3.0  3  train  foo
    2  1 2013-01-02  3.0  3   test  foo
    3  1 2013-01-02  3.0  3  train  foo
    -------显示前1行数据-----
       A          B    C  D     E    F
    0  1 2013-01-02  3.0  3  test  foo
    -------显示前3行数据-----
       A          B    C  D      E    F
    0  1 2013-01-02  3.0  3   test  foo
    1  1 2013-01-02  3.0  3  train  foo
    2  1 2013-01-02  3.0  3   test  foo
    head()

      2、tail():从下往上查看数据

    import pandas as pd
    from pandas import Series,DataFrame
    import numpy as np
    
    def create_numpay_dataform():
        df2 = pd.DataFrame({'A': 1,
                            'B': pd.Timestamp('20130102'),
                            'C': pd.Series(3, index=list(range(4)), dtype='float32'),
                            'D': np.array([3] * 4, dtype='int32'),
                            'E': pd.Categorical(["test", "train", "test", "train"]),
                            'F': 'foo'})
    
        return df2
    
    def print_data():
    
        df = create_numpay_dataform()
    
        print("-------显示原始数据-----")
        print(df)
        print("-------显示后1行数据-----")
        print(df.tail(1))
        print("-------显示后2行数据-----")
        print(df.tail(2))
    
    if __name__ == '__main__':
        print_data()
    
    
    
    
    #结果如下
    -------显示原始数据-----
       A          B    C  D      E    F
    0  1 2013-01-02  3.0  3   test  foo
    1  1 2013-01-02  3.0  3  train  foo
    2  1 2013-01-02  3.0  3   test  foo
    3  1 2013-01-02  3.0  3  train  foo
    -------显示后1行数据-----
       A          B    C  D      E    F
    3  1 2013-01-02  3.0  3  train  foo
    -------显示后2行数据-----
       A          B    C  D      E    F
    2  1 2013-01-02  3.0  3   test  foo
    3  1 2013-01-02  3.0  3  train  foo
    tail()

      3、to_numpy():输出底成数据的numpy对象。

      注意:

        a.Numpy的数组只有一种数据类型

        b.DataFrame数组每列的数据类型各不相同

        c.DataFrame数组的列有多种数据类型组成,该操作消费系统资源较大

        d.调用to_numpy()时Pandas查找支持查找DataFrame里说有数据类型的Numpy数据类型

        e.还有一种数据类型时object,可以将DataFrame列里的值强行转化成python对象

    import pandas as pd
    from pandas import Series,DataFrame
    import numpy as np
    
    def create_numpay_dataform():
        dates = pd.date_range('20130101', periods=6)
        df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
    
        df2 = pd.DataFrame({'A': 1,
                            'B': pd.Timestamp('20130102'),
                            'C': pd.Series(3, index=list(range(4)), dtype='float32'),
                            'D': np.array([3] * 4, dtype='int32'),
                            'E': pd.Categorical(["test", "train", "test", "train"]),
                            'F': 'foo'})
    
    
    
        return df,df2
    
    def print_data():
    
        df1,df2 = create_numpay_dataform()
    
        print("-------DF1-----")
        print(df1)
        print("
    "+"-------DF2-----")
        print(df2)
    
        print("
    " + "-------DF1 to to_numpy-----")
        print(" df1 这个 DataFrame 里的值都是浮点数,DataFrame.to_numpy() 的操作会很快,而且不复制数据。")
        print(df1.to_numpy())
    
    
        print("
    " + "-------DF2 to to_numpy-----")
        print("df2 这个 DataFrame 包含了多种类型,DataFrame.to_numpy() 操作就会耗费较多资源。")
        print(df2.to_numpy())
    
    if __name__ == '__main__':
        print_data()
    
    
    #结果如下
    -------DF1-----
                       A         B         C         D
    2013-01-01  0.214933 -0.932719  0.409751 -1.579671
    2013-01-02  0.857846 -0.450446  1.334220 -0.256340
    2013-01-03  1.855527 -0.459457 -0.088609  1.970731
    2013-01-04 -0.315940  1.216017  0.145649  0.844216
    2013-01-05  1.229986 -0.307384 -0.816692 -1.266780
    2013-01-06 -0.324177 -0.606538 -0.993541 -1.018344
    
    -------DF2-----
       A          B    C  D      E    F
    0  1 2013-01-02  3.0  3   test  foo
    1  1 2013-01-02  3.0  3  train  foo
    2  1 2013-01-02  3.0  3   test  foo
    3  1 2013-01-02  3.0  3  train  foo
    
    -------DF1 to to_numpy-----
     df1 这个 DataFrame 里的值都是浮点数,DataFrame.to_numpy() 的操作会很快,而且不复制数据。
    [[ 0.21493314 -0.93271907  0.40975128 -1.57967127]
     [ 0.85784569 -0.45044625  1.3342199  -0.25634002]
     [ 1.85552743 -0.45945651 -0.08860859  1.97073069]
     [-0.31593997  1.2160171   0.14564932  0.8442159 ]
     [ 1.22998622 -0.30738437 -0.81669186 -1.26677969]
     [-0.3241766  -0.60653794 -0.99354086 -1.01834351]]
    
    -------DF2 to to_numpy-----
    df2 这个 DataFrame 包含了多种类型,DataFrame.to_numpy() 操作就会耗费较多资源。
    [[1 Timestamp('2013-01-02 00:00:00') 3.0 3 'test' 'foo']
     [1 Timestamp('2013-01-02 00:00:00') 3.0 3 'train' 'foo']
     [1 Timestamp('2013-01-02 00:00:00') 3.0 3 'test' 'foo']
     [1 Timestamp('2013-01-02 00:00:00') 3.0 3 'train' 'foo']]
    to_numpy()    

      4、describe():可以快速查看数据的统计摘要,有三个参数

        a.第一个percentiles,这个参数可以设定数值型特征的统计量,默认是[.25, .5, .75],也就是返回25%,50%,75%数据量时的数字,但是这个可以修改的,describe(percentiles=[.2,.75, .8])默认有5

        b.第二个参数:include,这个参数默认是只计算数值型特征的统计量,当输入include=['O'],会计算离散型变量的统计特征,此外传参数是‘all’的时候会把数值型和离散型特征的统计都进行显示。

        c.第三个参数的设计就更贴心了,第二个参数是你可以指定选那些,第三个参数就是你可以指定不选哪些,人性化设计。这个参数默认不丢弃任何列,相当于无影响。

        d.如果只想显示某一行的结果需要使用:

    import pandas as pd
    from pandas import Series,DataFrame
    import numpy as np
    
    def create_numpay_dataform():
        
        df = pd.DataFrame(data={
            'A': list('abaacdadaf'),
            'B': [2, 4, 6, 3, 6, 2, 5, 8, 0, 2]
        })
    
    
        return df
    
    def print_data():
    
        df1 = create_numpay_dataform()
    
        print("-------DF1-----")
        print(df1)
        print("
    "+"-------DF2.describe()无参数-----")
        #第一个参数默认显示25%,50%,75%
        #第二个参数默认显示数值类型
        #第三个参数默认为None不丢弃任何列,相当于无影响
        print(df1.describe())
    
        print("
    " + "-------DF2.describe()第一个参数-----")
        #返回10 %,60 %,80 %,90 % 数据量时的数字,50 %默认显示
        print(df1.describe(percentiles=[.1,.6,.8,.9]))
    
    
        print("
    " + "-------DF2.describe()第二个参数=all -----")
        #‘all’的时候会把数值型和离散型特征的统计都进行显示
        print(df1.describe(include="all"))
        print("
    " + "-------DF2.describe()第二个参数=O -----")
        #include=['O'],会计算离散型变量的统计特征
        print(df1.describe(include='O'))
    
    
        print("
    " + "-------DF2.describe()第三个参数 -----")
        #exclude='O'表示不输出离散型
        print(df1.describe(exclude='O'))
    
        print("
    " + "-------DF2.describe()显示第N行结果 -----")
        print(df1.describe().iloc[4])
    
    if __name__ == '__main__':
        print_data()
    
    
    #结果如下
    
    -------DF1-----
       A  B
    0  a  2
    1  b  4
    2  a  6
    3  a  3
    4  c  6
    5  d  2
    6  a  5
    7  d  8
    8  a  0
    9  f  2
    
    -------DF2.describe()无参数-----
                   B
    count  10.000000
    mean    3.800000
    std     2.440401
    min     0.000000
    25%     2.000000
    50%     3.500000
    75%     5.750000
    max     8.000000
    
    -------DF2.describe()第一个参数-----
                   B
    count  10.000000
    mean    3.800000
    std     2.440401
    min     0.000000
    10%     1.800000
    50%     3.500000
    60%     4.400000
    80%     6.000000
    90%     6.200000
    max     8.000000
    
    -------DF2.describe()第二个参数=all -----
              A          B
    count    10  10.000000
    unique    5        NaN
    top       a        NaN
    freq      5        NaN
    mean    NaN   3.800000
    std     NaN   2.440401
    min     NaN   0.000000
    25%     NaN   2.000000
    50%     NaN   3.500000
    75%     NaN   5.750000
    max     NaN   8.000000
    
    -------DF2.describe()第二个参数=O -----
             A
    count   10
    unique   5
    top      a
    freq     5
    
    -------DF2.describe()第三个参数 -----
                   B
    count  10.000000
    mean    3.800000
    std     2.440401
    min     0.000000
    25%     2.000000
    50%     3.500000
    75%     5.750000
    max     8.000000
    
    -------DF2.describe()显示第N行结果 -----
    B    2.0
    Name: 25%, dtype: float64
    
    进程已结束,退出代码 0
    describe()

      5、sort_index():按轴排序

    import pandas as pd
    from pandas import Series,DataFrame
    import numpy as np
    
    def create_numpay_dataform():
    
        df = pd.DataFrame(data={
            'A': list('abaacdadaf'),
            'B': [2, 4, 6, 3, 6, 2, 5, 8, 0, 2],
            'F':  [2, 6, 6, 7, 9, 2, 0, 8, 0, 2],
            'D': list('afvarfxaes'),
    
        })
    
    
        return df
    
    def print_data():
    
        df1 = create_numpay_dataform()
    
        print("-------DF1-----")
        print(df1)
        print("
    "+"-------DF2按索引排序-----")
        #sort_index()默认正序,ascending=False
        print(df1.sort_index( ascending=False))
    
        print("
    " + "-------DF2按列排序(表头排序)-----")
        # sort_index()默认正序,ascending=False
        print(df1.sort_index(axis=1 ,ascending=False))
    
    
    
    if __name__ == '__main__':
        print_data()
    
    
    #结果如下
    
    -------DF1-----
       A  B  F  D
    0  a  2  2  a
    1  b  4  6  f
    2  a  6  6  v
    3  a  3  7  a
    4  c  6  9  r
    5  d  2  2  f
    6  a  5  0  x
    7  d  8  8  a
    8  a  0  0  e
    9  f  2  2  s
    
    -------DF2按索引排序-----
       A  B  F  D
    9  f  2  2  s
    8  a  0  0  e
    7  d  8  8  a
    6  a  5  0  x
    5  d  2  2  f
    4  c  6  9  r
    3  a  3  7  a
    2  a  6  6  v
    1  b  4  6  f
    0  a  2  2  a
    
    -------DF2按列排序(表头排序)-----
       F  D  B  A
    0  2  a  2  a
    1  6  f  4  b
    2  6  v  6  a
    3  7  a  3  a
    4  9  r  6  c
    5  2  f  2  d
    6  0  x  5  a
    7  8  a  8  d
    8  0  e  0  a
    9  2  s  2  f
    
    进程已结束,退出代码 0
    sort_index()

      6、sort_value():按值排序

    import pandas as pd
    from pandas import Series,DataFrame
    import numpy as np
    
    def create_numpay_dataform():
    
        df = pd.DataFrame(data={
            'A': list('abaacdadaf'),
            'B': [2, 4, 6, 3, 6, 2, 5, 8, 0, 2],
            'F':  [2, 6, 6, 7, 9, 2, 0, 8, 0, 2],
            'D': list('afvarfxaes'),
    
        })
    
    
        return df
    
    def print_data():
    
        df1 = create_numpay_dataform()
    
        print("-------DF1-----")
        print(df1)
        print("
    "+"-------DF1值排序-----")
        #sort_index()默认正序,ascending=False
        print(df1.sort_values(by='B' ,ascending=False))
    
    
    
    
    if __name__ == '__main__':
        print_data()
    
    #结果如下
    
    -------DF1-----
       A  B  F  D
    0  a  2  2  a
    1  b  4  6  f
    2  a  6  6  v
    3  a  3  7  a
    4  c  6  9  r
    5  d  2  2  f
    6  a  5  0  x
    7  d  8  8  a
    8  a  0  0  e
    9  f  2  2  s
    
    -------DF1值排序-----
       A  B  F  D
    7  d  8  8  a
    2  a  6  6  v
    4  c  6  9  r
    6  a  5  0  x
    1  b  4  6  f
    3  a  3  7  a
    0  a  2  2  a
    5  d  2  2  f
    9  f  2  2  s
    8  a  0  0  e
    
    进程已结束,退出代码 0
    sort_value()

      7、inde:显示索引(列,最前面一列)

    import pandas as pd
    from pandas import Series,DataFrame
    import numpy as np
    
    def create_numpay_dataform():
        df2 = pd.DataFrame({'A': 1,
                            'B': pd.Timestamp('20130102'),
                            'C': pd.Series(3, index=list(range(4)), dtype='float32'),
                            'D': np.array([3] * 4, dtype='int32'),
                            'E': pd.Categorical(["test", "train", "test", "train"]),
                            'F': 'foo'})
    
        return df2
    
    def print_data():
    
        df = create_numpay_dataform()
    
        print("-------显示原始数据-----")
        print(df)
        print("-------索引-----")
        print(df.index)
    
    
    if __name__ == '__main__':
        print_data()
    
    
    #结果如下
    -------显示原始数据-----
       A          B    C  D      E    F
    0  1 2013-01-02  3.0  3   test  foo
    1  1 2013-01-02  3.0  3  train  foo
    2  1 2013-01-02  3.0  3   test  foo
    3  1 2013-01-02  3.0  3  train  foo
    -------索引-----
    Int64Index([0, 1, 2, 3], dtype='int64')
    index

      8、columns:显示列名(行,最上面一行)

    import pandas as pd
    from pandas import Series,DataFrame
    import numpy as np
    
    def create_numpay_dataform():
        df2 = pd.DataFrame({'A': 1,
                            'B': pd.Timestamp('20130102'),
                            'C': pd.Series(3, index=list(range(4)), dtype='float32'),
                            'D': np.array([3] * 4, dtype='int32'),
                            'E': pd.Categorical(["test", "train", "test", "train"]),
                            'F': 'foo'})
    
        return df2
    
    def print_data():
    
        df = create_numpay_dataform()
    
        print("-------显示原始数据-----")
        print(df)
        print("-------列名-----")
        print(df.columns)
    
    
    if __name__ == '__main__':
        print_data()
    
    
    #结果如下
    
    -------显示原始数据-----
       A          B    C  D      E    F
    0  1 2013-01-02  3.0  3   test  foo
    1  1 2013-01-02  3.0  3  train  foo
    2  1 2013-01-02  3.0  3   test  foo
    3  1 2013-01-02  3.0  3  train  foo
    -------列名-----
    Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
    columns

      9、T:转置数据

    import pandas as pd
    from pandas import Series,DataFrame
    import numpy as np
    
    def create_numpay_dataform():
        df = pd.DataFrame(data={
            'A': list('abaacdadaf'),
            'B': [2, 4, 6, 3, 6, 2, 5, 8, 0, 2],
            'F':  [2, 6, 6, 7, 9, 2, 0, 8, 0, 2],
            'D': list('afvarfxaes'),
    
        })
    
    
        return df
    
    def print_data():
    
        df1 = create_numpay_dataform()
    
        print("-------DF1-----")
        print(df1)
        print("
    "+"-------DF1行列转换-----")
        print(df1.T)
    
    
    
    
    if __name__ == '__main__':
        print_data()
    
    #结果如下
    
    -------DF1-----
       A  B  F  D
    0  a  2  2  a
    1  b  4  6  f
    2  a  6  6  v
    3  a  3  7  a
    4  c  6  9  r
    5  d  2  2  f
    6  a  5  0  x
    7  d  8  8  a
    8  a  0  0  e
    9  f  2  2  s
    
    -------DF1行列转换-----
       0  1  2  3  4  5  6  7  8  9
    A  a  b  a  a  c  d  a  d  a  f
    B  2  4  6  3  6  2  5  8  0  2
    F  2  6  6  7  9  2  0  8  0  2
    D  a  f  v  a  r  f  x  a  e  s
    
    进程已结束,退出代码 0
    T

    三、选择数据

      1、选择单列[ "列名" ] 或者 df.列名

    import pandas as pd
    from pandas import Series,DataFrame
    import numpy as np
    
    def create_numpay_dataform():
        
        df = pd.DataFrame(data={
            'A': list('abaacdadaf'),
            'B': [2, 4, 6, 3, 6, 2, 5, 8, 0, 2],
            'F':  [2, 6, 6, 7, 9, 2, 0, 8, 0, 2],
            'D': list('afvarfxaes'),
    
        })
    
    
        return df
    
    def print_data():
    
        df1 = create_numpay_dataform()
    
        print("-------DF1-----")
        print(df1)
        print("
    "+"-------选择单列['列名']-----")
        print(df1["D"])
        print("
    " + "-------选择单列DF1.列名-----")
        print(df1.A)
    
    
    
    
    if __name__ == '__main__':
        print_data()
    
    
    #结果如下
    
    -------DF1-----
       A  B  F  D
    0  a  2  2  a
    1  b  4  6  f
    2  a  6  6  v
    3  a  3  7  a
    4  c  6  9  r
    5  d  2  2  f
    6  a  5  0  x
    7  d  8  8  a
    8  a  0  0  e
    9  f  2  2  s
    
    -------选择单列['列名']-----
    0    a
    1    f
    2    v
    3    a
    4    r
    5    f
    6    x
    7    a
    8    e
    9    s
    Name: D, dtype: object
    
    -------选择单列DF1.列名-----
    0    a
    1    b
    2    a
    3    a
    4    c
    5    d
    6    a
    7    d
    8    a
    9    f
    Name: A, dtype: object
    
    进程已结束,退出代码 0
    
    
    
    
    
    
    
    
    
    
    
    
    
    
      
    选择单列

      2、用 [ ] 切片行

    import pandas as pd
    from pandas import Series,DataFrame
    import numpy as np
    
    def create_numpay_dataform():
        
        df = pd.DataFrame(data={
            'A': list('abaacdadaf'),
            'B': [2, 4, 6, 3, 6, 2, 5, 8, 0, 2],
            'F':  [2, 6, 6, 7, 9, 2, 0, 8, 0, 2],
            'D': list('afvarfxaes'),
    
        })
    
    
        return df
    
    def print_data():
    
        df1 = create_numpay_dataform()
    
        print("-------DF1-----")
        print(df1)
        print("
    "+"-------切片行-----")
        print(df1[4:6])
        
    
    
    
    
    if __name__ == '__main__':
        print_data()
    
    
    #结果如下
    
    -------DF1-----
       A  B  F  D
    0  a  2  2  a
    1  b  4  6  f
    2  a  6  6  v
    3  a  3  7  a
    4  c  6  9  r
    5  d  2  2  f
    6  a  5  0  x
    7  d  8  8  a
    8  a  0  0  e
    9  f  2  2  s
    
    -------切片行-----
       A  B  F  D
    4  c  6  9  r
    5  d  2  2  f
    
    进程已结束,退出代码 0
    用 [ ] 切片行

      3、loc:按标签选择

    import pandas as pd
    from pandas import Series,DataFrame
    import numpy as np
    
    def create_numpay_dataform():
        
    
        df = pd.DataFrame(data={
            'A': list('abaacdadaf'),
            'B': [2, 4, 6, 3, 6, 2, 5, 8, 0, 2],
            'F':  [2, 6, 6, 7, 9, 2, 0, 8, 0, 2],
            'D': list('afvarfxaes'),
    
        })
    
    
        return df
    
    def print_data():
    
        df1 = create_numpay_dataform()
    
        print("-------DF1-----")
        print(df1)
        print("
    "+"-------标签选择某行数据------")
        print(df1.loc[8])
        print("
    " + "-------标签选择多行数据------")
        print(df1.loc[[1,6,8]])
        print("
    " + "-------标签选择多列数据------")
        print(df1.loc[:,['A', 'D']])
        print("
    " + "-------标签选择规定行,规定列数据------")
        print(df1.loc[4:7, ['A', 'D']])
        print("
    " + "-------数据降维------")
        print(df1.loc[5, ['A', 'D']])
    
       
    
    if __name__ == '__main__':
        print_data()
    
    #结果如下
    
    -------DF1-----
       A  B  F  D
    0  a  2  2  a
    1  b  4  6  f
    2  a  6  6  v
    3  a  3  7  a
    4  c  6  9  r
    5  d  2  2  f
    6  a  5  0  x
    7  d  8  8  a
    8  a  0  0  e
    9  f  2  2  s
    
    -------标签选择某行数据------
    A    a
    B    0
    F    0
    D    e
    Name: 8, dtype: object
    
    -------标签选择多行数据------
       A  B  F  D
    1  b  4  6  f
    6  a  5  0  x
    8  a  0  0  e
    
    -------标签选择多列数据------
       A  D
    0  a  a
    1  b  f
    2  a  v
    3  a  a
    4  c  r
    5  d  f
    6  a  x
    7  d  a
    8  a  e
    9  f  s
    
    -------标签选择规定行,规定列数据------
       A  D
    4  c  r
    5  d  f
    6  a  x
    7  d  a
    
    -------数据降维------
    A    d
    D    f
    Name: 5, dtype: object
    
    进程已结束,退出代码 0
    loc

       4、iloc:按位置选取

    import pandas as pd
    from pandas import Series,DataFrame
    import numpy as np
    
    def create_numpay_dataform():
    
    
        df = pd.DataFrame(data={
            'A': list('abaacdadaf'),
            'B': [2, 4, 6, 3, 6, 2, 5, 8, 0, 2],
            'F':  [2, 6, 6, 7, 9, 2, 0, 8, 0, 2],
            'D': list('afvarfxaes'),
    
        })
    
    
        return df
    
    def print_data():
    
        df1 = create_numpay_dataform()
    
        print("-------DF1-----")
        print(df1)
       
    
        print("
    " + "-------按位置选择行------")
        print(df1.iloc[3])
        print("
    " + "-------按位置选择切片行和列------")
        print(df1.iloc[3:,:3])
        print("
    " + "-------按位置选择指定行和列------")
        print(df1.iloc[[1,3,5], [0,2]])
       
    
    if __name__ == '__main__':
        print_data()
    
    #结果如下
    
    -------DF1-----
       A  B  F  D
    0  a  2  2  a
    1  b  4  6  f
    2  a  6  6  v
    3  a  3  7  a
    4  c  6  9  r
    5  d  2  2  f
    6  a  5  0  x
    7  d  8  8  a
    8  a  0  0  e
    9  f  2  2  s
    
    -------按位置选择行------
    A    a
    B    3
    F    7
    D    a
    Name: 3, dtype: object
    
    -------按位置选择切片行和列------
       A  B  F
    3  a  3  7
    4  c  6  9
    5  d  2  2
    6  a  5  0
    7  d  8  8
    8  a  0  0
    9  f  2  2
    
    -------按位置选择指定行和列------
       A  F
    1  b  6
    3  a  7
    5  d  2
    
    进程已结束,退出代码 0
    iloc

      6、单个值布尔判断(注意判断时要保证数据结构是一致的,不同数据结构之间判断会报错)

    import pandas as pd
    from pandas import Series,DataFrame
    import numpy as np
    
    def create_numpay_dataform():
        dates = pd.date_range('20130101', periods=6)
        df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
    
        return df
    
    def print_data():
    
        df1 = create_numpay_dataform()
    
        print("-------DF1-----")
        print(df1)
        
        print("
    " + "-------布尔索引判断某一列的值------")
        print(df1[df1.B > 0])
        print("
    " + "-------布尔索引判断整体的值------")
        print(df1[df1 > 0.543791])
    
    
    
    if __name__ == '__main__':
        print_data()
    
    
    #结果如下
    
    -------DF1-----
                       A         B         C         D
    2013-01-01 -0.571184  0.810240 -1.834513 -0.185410
    2013-01-02 -0.085790 -1.009361  1.311410  0.141120
    2013-01-03  0.672282  0.569641 -1.394152  0.832807
    2013-01-04  0.170832 -0.882142  0.928596 -0.945374
    2013-01-05 -1.100324 -1.045981  1.217005  1.420321
    2013-01-06 -0.952931  0.575549 -0.164552 -1.097455
    
    -------布尔索引判断某一列的值------
                       A         B         C         D
    2013-01-01 -0.571184  0.810240 -1.834513 -0.185410
    2013-01-03  0.672282  0.569641 -1.394152  0.832807
    2013-01-06 -0.952931  0.575549 -0.164552 -1.097455
    
    -------布尔索引判断整体的值------
                       A         B         C         D
    2013-01-01       NaN  0.810240       NaN       NaN
    2013-01-02       NaN       NaN  1.311410       NaN
    2013-01-03  0.672282  0.569641       NaN  0.832807
    2013-01-04       NaN       NaN  0.928596       NaN
    2013-01-05       NaN       NaN  1.217005  1.420321
    2013-01-06       NaN  0.575549       NaN       NaN
    
    进程已结束,退出代码 0
    布尔判断

      7、isin():多个值做布尔判断

    import pandas as pd
    from pandas import Series,DataFrame
    import numpy as np
    
    def create_numpay_dataform():
        
        df = pd.DataFrame(data={
            'A': list('abaacdadaf'),
            'B': [2, 4, 6, 3, 6, 2, 5, 8, 0, 2],
            'F':  [2, 6, 6, 7, 9, 2, 0, 8, 0, 2],
            'D': list('afvarfxaes'),
    
        })
    
    
        return df
    
    def print_data():
    
        df1 = create_numpay_dataform()
    
        print("-------DF1-----")
        print(df1)
        
    
        print("
    " + "-------单列筛选-------")
        print(df1.F.isin(["a","b",6,9]))
        print("
    " + "-------整体筛选-------")
        print(df1.isin(["a", "b", 6, 9]))
    
    if __name__ == '__main__':
        print_data()
    
    #结果如下
    
    -------DF1-----
       A  B  F  D
    0  a  2  2  a
    1  b  4  6  f
    2  a  6  6  v
    3  a  3  7  a
    4  c  6  9  r
    5  d  2  2  f
    6  a  5  0  x
    7  d  8  8  a
    8  a  0  0  e
    9  f  2  2  s
    
    -------单列筛选-------
    0    False
    1     True
    2     True
    3    False
    4     True
    5    False
    6    False
    7    False
    8    False
    9    False
    Name: F, dtype: bool
    
    -------整体筛选-------
           A      B      F      D
    0   True  False  False   True
    1   True  False   True  False
    2   True   True   True  False
    3   True  False  False   True
    4  False   True   True  False
    5  False  False  False  False
    6   True  False  False  False
    7  False  False  False   True
    8   True  False  False  False
    9  False  False  False  False
    
    进程已结束,退出代码 0
    isin()

      8、赋值:赋值是个非常简单的操作,只要使用标签选择或者位置选择找到了对应的数据,直接赋值即可。在赋值的时候也可以使用条件判断来赋值

      9、缺失值:Pandas 主要用 np.nan 表示缺失数据。 计算时,默认不包含空值。

    import pandas as pd
    from pandas import Series,DataFrame
    import numpy as np
    
    def create_numpay_dataform():
        
        df = pd.DataFrame(data={
            'A': list('abaacdadaf'),
            'B': [2, 4, 6, 3, 6, 2, 5, 8, 0, 2],
            'F':  [2, 6, 6, 7, 9, 2, 0, 8, 0, 2],
            'D': list('afvarfxaes'),
    
        })
    
    
        return df
    
    def print_data():
    
        df1 = create_numpay_dataform()
    
        print("-------DF1-----")
        print(df1)
        
    
        print("-------设置缺失值----------")
        df1.iloc[1:4, :-1] = np.nan
        print(df1)
        print("-------删除缺失行----------")
        #how = all整行数据都确实时才删除行数据,any只要存在缺失数据,就会删除本行
        print(df1.dropna(how="all"))
        print("-------填充缺失数据----------")
        print(df1.fillna(value=10086))
        print("-------判断是否是缺失数据----------")
        print(pd.isna(df1))
    
    
    
    if __name__ == '__main__':
        print_data()
    
    #结果如下
    
    -------DF1-----
       A  B  F  D
    0  a  2  2  a
    1  b  4  6  f
    2  a  6  6  v
    3  a  3  7  a
    4  c  6  9  r
    5  d  2  2  f
    6  a  5  0  x
    7  d  8  8  a
    8  a  0  0  e
    9  f  2  2  s
    -------设置缺失值----------
         A    B    F  D
    0    a  2.0  2.0  a
    1  NaN  NaN  NaN  f
    2  NaN  NaN  NaN  v
    3  NaN  NaN  NaN  a
    4    c  6.0  9.0  r
    5    d  2.0  2.0  f
    6    a  5.0  0.0  x
    7    d  8.0  8.0  a
    8    a  0.0  0.0  e
    9    f  2.0  2.0  s
    -------删除缺失行----------
         A    B    F  D
    0    a  2.0  2.0  a
    1  NaN  NaN  NaN  f
    2  NaN  NaN  NaN  v
    3  NaN  NaN  NaN  a
    4    c  6.0  9.0  r
    5    d  2.0  2.0  f
    6    a  5.0  0.0  x
    7    d  8.0  8.0  a
    8    a  0.0  0.0  e
    9    f  2.0  2.0  s
    -------填充缺失数据----------
           A        B        F  D
    0      a      2.0      2.0  a
    1  10086  10086.0  10086.0  f
    2  10086  10086.0  10086.0  v
    3  10086  10086.0  10086.0  a
    4      c      6.0      9.0  r
    5      d      2.0      2.0  f
    6      a      5.0      0.0  x
    7      d      8.0      8.0  a
    8      a      0.0      0.0  e
    9      f      2.0      2.0  s
    -------判断是否是缺失数据----------
           A      B      F      D
    0  False  False  False  False
    1   True   True   True  False
    2   True   True   True  False
    3   True   True   True  False
    4  False  False  False  False
    5  False  False  False  False
    6  False  False  False  False
    7  False  False  False  False
    8  False  False  False  False
    9  False  False  False  False
    
    进程已结束,退出代码 0
    np.nan

    四、统计

      1、在对数组进行计算的时候要使数组对齐,这时候需要使用shift()方法。(效率不是特别高)

    import pandas as pd
    import numpy as np
    
    def create_numpay_dataform():
        dates = pd.date_range('20130101', periods=6)
        df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
    
       
    
        return df
    
    def print_data():
    
        df1 = create_numpay_dataform()
    
        print("-------DF1-----")
        print(df1)
    
        print("------行向下平移一行--------")
        print(df1.shift())
    
        print("------列向左平移一列--------")
        print(df1.shift(-1,axis=1))
    
    
    
    
    
    if __name__ == '__main__':
        print_data()
    
    
    #结果如下
    
    -------DF1-----
                       A         B         C         D
    2013-01-01 -1.494982 -1.816127 -1.557673  0.676270
    2013-01-02 -0.382565 -0.772728 -2.028113 -1.000548
    2013-01-03  1.024764  1.438836 -2.294408 -0.391837
    2013-01-04  0.460244  1.823243 -0.183927  1.755757
    2013-01-05  0.655894 -0.193546  1.155935 -0.773810
    2013-01-06 -2.142355 -0.583462  1.369368  0.703252
    ------行向下平移一行--------
                       A         B         C         D
    2013-01-01       NaN       NaN       NaN       NaN
    2013-01-02 -1.494982 -1.816127 -1.557673  0.676270
    2013-01-03 -0.382565 -0.772728 -2.028113 -1.000548
    2013-01-04  1.024764  1.438836 -2.294408 -0.391837
    2013-01-05  0.460244  1.823243 -0.183927  1.755757
    2013-01-06  0.655894 -0.193546  1.155935 -0.773810
    ------列向左平移一列--------
                       A         B         C   D
    2013-01-01 -1.816127 -1.557673  0.676270 NaN
    2013-01-02 -0.772728 -2.028113 -1.000548 NaN
    2013-01-03  1.438836 -2.294408 -0.391837 NaN
    2013-01-04  1.823243 -0.183927  1.755757 NaN
    2013-01-05 -0.193546  1.155935 -0.773810 NaN
    2013-01-06 -0.583462  1.369368  0.703252 NaN
    
    进程已结束,退出代码 0
    
        
    shift()  

      2、mean():计算平均值

    import pandas as pd
    from pandas import Series,DataFrame
    import numpy as np
    
    def create_numpay_dataform():
        dates = pd.date_range('20130101', periods=6)
        df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
    
        return df
    
    def print_data():
    
        df1 = create_numpay_dataform()
    
        print("-------DF1-----")
        print(df1)
       
        print("-------设置缺失值----------")
        df1.iloc[1:4, :-1] = np.nan
        print(df1)
    
        print("-------填充缺失数据----------")
        c1 = df1.fillna(value=10086)
        print(c1)
        
    
        print("-------计算无缺失数组的平均值----------")
        #axis = 0 以列计算平均值   1 以行计算平均值,默认为0
        print(c1.mean(axis=1))
        print("-------计算有缺失数组的平均值----------")
        #有缺失数组的平均值去掉缺失数据然后计算
        print(df1.mean())
        
    if __name__ == '__main__':
        print_data()
    
    
    #结果如下
    
    -------DF1-----
                       A         B         C         D
    2013-01-01  0.504033  0.167604  0.656164 -0.305116
    2013-01-02  0.743423  1.004330 -1.858694 -0.962968
    2013-01-03 -0.978681 -0.858943  1.527813  0.442333
    2013-01-04 -0.447715 -1.075530  0.655507  1.271325
    2013-01-05  0.877627  0.641684 -1.701115 -0.211141
    2013-01-06  2.704554 -0.666753 -1.092838 -2.232137
    -------设置缺失值----------
                       A         B         C         D
    2013-01-01  0.504033  0.167604  0.656164 -0.305116
    2013-01-02       NaN       NaN       NaN -0.962968
    2013-01-03       NaN       NaN       NaN  0.442333
    2013-01-04       NaN       NaN       NaN  1.271325
    2013-01-05  0.877627  0.641684 -1.701115 -0.211141
    2013-01-06  2.704554 -0.666753 -1.092838 -2.232137
    -------填充缺失数据----------
                           A             B             C         D
    2013-01-01      0.504033      0.167604      0.656164 -0.305116
    2013-01-02  10086.000000  10086.000000  10086.000000 -0.962968
    2013-01-03  10086.000000  10086.000000  10086.000000  0.442333
    2013-01-04  10086.000000  10086.000000  10086.000000  1.271325
    2013-01-05      0.877627      0.641684     -1.701115 -0.211141
    2013-01-06      2.704554     -0.666753     -1.092838 -2.232137
    -------计算无缺失数组的平均值----------
    A    5043.681036
    B    5043.023756
    C    5042.643702
    D      -0.332951
    dtype: float64
    -------计算有缺失数组的平均值----------
    A    1.362071
    B    0.047512
    C   -0.712596
    D   -0.332951
    dtype: float64
    
    
    进程已结束,退出代码 0
    mean()

      3、diff():计算两行之间的差值 

    import pandas as pd
    import numpy as np
    
    def create_numpay_dataform():
        dates = pd.date_range('20130101', periods=6)
        df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
    
    
    
        return df
    
    def print_data():
    
        df1 = create_numpay_dataform()
    
        print("-------DF1-----")
        print(df1)
    
        print("------上下两行的差值(后面的数-前面的数)--------")
        print(df1.diff(-1))
    
        print("------左右步长为2的差值(后面的数-前面的数)--------")
        print(df1.diff(2,axis=1))
    
    
    
    
    
    if __name__ == '__main__':
        print_data()
    
    
    #结果如下
    
    -------DF1-----
                       A         B         C         D
    2013-01-01 -1.482337  0.735672 -0.523935  1.441714
    2013-01-02 -0.293590 -1.251721 -0.532770 -0.178270
    2013-01-03  0.464124  0.148478  0.647906 -0.462180
    2013-01-04 -1.313573 -0.280773 -0.815059  0.449937
    2013-01-05 -0.042054  0.037449 -1.380082  1.694301
    2013-01-06 -0.685625  0.379272 -0.009392 -0.563834
    ------上下两行的差值(后面的数-前面的数)--------
                       A         B         C         D
    2013-01-01 -1.188747  1.987393  0.008835  1.619984
    2013-01-02 -0.757714 -1.400199 -1.180676  0.283910
    2013-01-03  1.777696  0.429250  1.462965 -0.912117
    2013-01-04 -1.271519 -0.318222  0.565023 -1.244364
    2013-01-05  0.643571 -0.341823 -1.370691  2.258135
    2013-01-06       NaN       NaN       NaN       NaN
    ------左右步长为2的差值(后面的数-前面的数)--------
                 A   B         C         D
    2013-01-01 NaN NaN  0.958402  0.706042
    2013-01-02 NaN NaN -0.239180  1.073451
    2013-01-03 NaN NaN  0.183783 -0.610658
    2013-01-04 NaN NaN  0.498513  0.730710
    2013-01-05 NaN NaN -1.338028  1.656852
    2013-01-06 NaN NaN  0.676234 -0.943107
    
    进程已结束,退出代码 0
    diff()

      4、sub():两个DataFrame或者DataFrame和series减法计算

    import pandas as pd
    import numpy as np
    
    def create_numpay_dataform():
        dates = pd.date_range('20130101', periods=6)
        df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
        df1 = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
        s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates)
        return df,df1,s
    
    def print_data():
    
        df,df1,s1 = create_numpay_dataform()
    
        print("-------DF-----")
        print(df)
        print("-------DF1--------")
        print(df1)
        print("-------s1--------")
        print(s1)
    
    
        print("------DataFrame和Series计算--------")
        print(df.sub(s1,axis="index"))
    
        print("------DataFrame和DataFrame计算--------")
        print(df.sub(df1))
    
    
    
    
    
    if __name__ == '__main__':
        print_data()
    
    
    #结果如下
    
    -------DF-----
                       A         B         C         D
    2013-01-01 -1.117915 -0.387520  0.013181 -0.732305
    2013-01-02  0.259804  0.943158  0.209316 -0.179862
    2013-01-03 -0.681971 -1.385040  0.354760 -0.572621
    2013-01-04 -0.019748 -0.703220 -0.765874 -0.584478
    2013-01-05  1.187278 -0.287918 -0.215136  0.075496
    2013-01-06 -1.160146 -0.882323 -0.620577  0.380190
    -------DF1--------
                       A         B         C         D
    2013-01-01  1.532277 -0.527844 -0.345524  0.701999
    2013-01-02  0.794895 -2.042780  1.163952 -0.877180
    2013-01-03 -0.489494 -0.131753  0.444089  0.789567
    2013-01-04  0.440047 -0.693099 -0.243348 -0.612980
    2013-01-05 -1.128350 -1.012848  0.632883 -0.023234
    2013-01-06 -0.672428 -0.249193  1.676576 -1.486626
    -------s1--------
    2013-01-01    1.0
    2013-01-02    3.0
    2013-01-03    5.0
    2013-01-04    NaN
    2013-01-05    6.0
    2013-01-06    8.0
    Freq: D, dtype: float64
    ------DataFrame和Series计算--------
                       A         B         C         D
    2013-01-01 -2.117915 -1.387520 -0.986819 -1.732305
    2013-01-02 -2.740196 -2.056842 -2.790684 -3.179862
    2013-01-03 -5.681971 -6.385040 -4.645240 -5.572621
    2013-01-04       NaN       NaN       NaN       NaN
    2013-01-05 -4.812722 -6.287918 -6.215136 -5.924504
    2013-01-06 -9.160146 -8.882323 -8.620577 -7.619810
    ------DataFrame和DataFrame计算--------
                       A         B         C         D
    2013-01-01 -2.650191  0.140324  0.358705 -1.434305
    2013-01-02 -0.535091  2.985938 -0.954636  0.697318
    2013-01-03 -0.192477 -1.253287 -0.089329 -1.362188
    2013-01-04 -0.459795 -0.010121 -0.522526  0.028502
    2013-01-05  2.315628  0.724930 -0.848019  0.098730
    2013-01-06 -0.487718 -0.633130 -2.297154  1.866815
    
    进程已结束,退出代码 0
    sub()

      5、apply():是pandas里面所有函数中自由度最高的函数。apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)

       a.该函数最有用的是第一个参数,这个参数是函数,这个函数需要自己实现

       b.函数的传入参数根据axis来定,比如axis = 1,就会把一行数据作为Series的数据结构传入给自己实现的函数中,我们在函数中实现对Series不同属性之间的计算,

       c.则apply函数会自动遍历每一行DataFrame的数据,最后将所有结果组合成一个Series数据结构并返回

    import pandas as pd
    import numpy as np
    
    def create_numpay_dataform():
        dates = pd.date_range('20130101', periods=6)
        df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
        
        return df
    
    def print_data():
    
        df = create_numpay_dataform()
    
        print("-------DF-----")
        print(df)
        
    
    
        print("------apply调用已有函数--------")
        #np.cumsun累加
        print(df.apply(np.cumsum))
    
        print("------apply调用自定义函数--------")
        print(df.apply(lambda x : x.max()-x[2]))
    
    
    
    if __name__ == '__main__':
        print_data()
    
    #结果如下
    
    -------DF-----
                       A         B         C         D
    2013-01-01 -2.086164  0.201652  1.722858  1.071210
    2013-01-02 -0.462887 -0.188189  0.733832 -1.798445
    2013-01-03  0.672316 -1.359191 -0.031073  0.508793
    2013-01-04 -0.624844 -0.503734  0.262923 -0.519521
    2013-01-05 -1.170108 -0.308858 -0.653888  0.552537
    2013-01-06 -1.408287  0.406629  0.000608  0.085242
    ------DataFrame和Series计算--------
                       A         B         C         D
    2013-01-01 -2.086164  0.201652  1.722858  1.071210
    2013-01-02 -2.549051  0.013464  2.456690 -0.727234
    2013-01-03 -1.876735 -1.345727  2.425617 -0.218442
    2013-01-04 -2.501579 -1.849461  2.688541 -0.737963
    2013-01-05 -3.671687 -2.158320  2.034652 -0.185426
    2013-01-06 -5.079974 -1.751691  2.035260 -0.100184
    ------DataFrame和Series计算--------
    A    0.000000
    B    1.765820
    C    1.753931
    D    0.562418
    dtype: float64
    
    进程已结束,退出代码 0
    apply()

      

    五、合并

      1、concat:多个DataFrame拼接。

    import pandas as pd
    import numpy as np
    
    def create_numpay_dataform():
        dates = pd.date_range('20130101', periods=9)
        df = pd.DataFrame(np.random.randn(9, 4), index=dates, columns=list('ABCD'))
    
        return df
    
    def print_data():
    
        df = create_numpay_dataform()
    
        print("-------DF-----")
        print(df)
    
        print("-------切片后的DF-------")
        pieces = [df[:3], df[3:7], df[7:]]
        print(pieces[0])
        print(pieces[1])
        print(pieces[2])
    
        print("-------将切片后的数组拼接上-------")
        print(pd.concat([pieces[0],pieces[2],pieces[1]]))
    
    
    
    
    if __name__ == '__main__':
        print_data()
    
    
    #结果如下
    
    -------DF-----
                       A         B         C         D
    2013-01-01  0.276946  1.235298  0.932776 -0.565113
    2013-01-02 -0.503525  0.365262  0.884855 -1.432992
    2013-01-03 -0.042289  0.923140 -0.067742 -0.993290
    2013-01-04 -0.560989 -0.433529 -0.339409  0.099952
    2013-01-05  0.032306  0.003271  0.605058  0.398746
    2013-01-06  0.033632 -1.831336  0.828554 -0.745181
    2013-01-07 -0.306900  0.027087  0.387204 -1.099752
    2013-01-08  0.580035 -0.305193 -0.287659 -1.204415
    2013-01-09  1.077574  1.034927 -0.360812 -0.792874
    -------切片后的DF-------
                       A         B         C         D
    2013-01-01  0.276946  1.235298  0.932776 -0.565113
    2013-01-02 -0.503525  0.365262  0.884855 -1.432992
    2013-01-03 -0.042289  0.923140 -0.067742 -0.993290
                       A         B         C         D
    2013-01-04 -0.560989 -0.433529 -0.339409  0.099952
    2013-01-05  0.032306  0.003271  0.605058  0.398746
    2013-01-06  0.033632 -1.831336  0.828554 -0.745181
    2013-01-07 -0.306900  0.027087  0.387204 -1.099752
                       A         B         C         D
    2013-01-08  0.580035 -0.305193 -0.287659 -1.204415
    2013-01-09  1.077574  1.034927 -0.360812 -0.792874
    -------将切片后的数组拼接上-------
                       A         B         C         D
    2013-01-01  0.276946  1.235298  0.932776 -0.565113
    2013-01-02 -0.503525  0.365262  0.884855 -1.432992
    2013-01-03 -0.042289  0.923140 -0.067742 -0.993290
    2013-01-08  0.580035 -0.305193 -0.287659 -1.204415
    2013-01-09  1.077574  1.034927 -0.360812 -0.792874
    2013-01-04 -0.560989 -0.433529 -0.339409  0.099952
    2013-01-05  0.032306  0.003271  0.605058  0.398746
    2013-01-06  0.033632 -1.831336  0.828554 -0.745181
    2013-01-07 -0.306900  0.027087  0.387204 -1.099752
    
    进程已结束,退出代码 0
    concat()

      2、merge():sql风格的合并,类似于连接。

    import pandas as pd
    import numpy as np
    
    def create_numpay_dataform():
        
        df1 = pd.DataFrame({"rng":["xiaohua","ming","uzi"],"age":[22,19,24]})
        df2 = pd.DataFrame({"rng": ["xiaohua", "ming", "uzi"], "role": ["mid", "sup", "adc"]})
        df3 = pd.DataFrame({"team":["rng","rng","rng"],"name":["xiaohua","ming","uzi"]})
        df4 = pd.DataFrame({"team":["rng","rng"],"opponent":["LGD","IG"]})
        return df1,df2,df3,df4
    
    def print_data():
    
        df1,df2,df3,df4 = create_numpay_dataform()
    
        print("-------DF1-----")
        print(df1)
    
        print("-------DF2-----")
        print(df2)
    
        print("-------DF3-----")
        print(df3)
    
        print("-------DF4-----")
        print(df4)
    
        print("-------key下是不同的value的使用方法-------")
        print(pd.merge(df1,df2,on="rng"))
    
        print("-------key下是统一value的使用-------")
        print(pd.merge(df3,df4,on="team"))
    
    
    if __name__ == '__main__':
        print_data()
    
    #结果如下
    
    -------DF1-----
           rng  age
    0  xiaohua   22
    1     ming   19
    2      uzi   24
    -------DF2-----
           rng role
    0  xiaohua  mid
    1     ming  sup
    2      uzi  adc
    -------DF3-----
      team     name
    0  rng  xiaohua
    1  rng     ming
    2  rng      uzi
    -------DF4-----
      team opponent
    0  rng      LGD
    1  rng       IG
    -------key下是不同的value的使用方法-------
           rng  age role
    0  xiaohua   22  mid
    1     ming   19  sup
    2      uzi   24  adc
    -------key下是统一value的使用-------
      team     name opponent
    0  rng  xiaohua      LGD
    1  rng  xiaohua       IG
    2  rng     ming      LGD
    3  rng     ming       IG
    4  rng      uzi      LGD
    5  rng      uzi       IG
    
    进程已结束,退出代码 0
    merge()

      3、append():在DataFrame最后追加数据

    import pandas as pd
    import numpy as np
    
    def create_numpay_dataform():
        dates = pd.date_range('20130101', periods=9)
        df = pd.DataFrame(np.random.randn(9, 4), index=dates, columns=list('ABCD'))
    
    
        return df
    
    def print_data():
    
        df = create_numpay_dataform()
    
        print("-------DF-----")
        print(df)
    
        print("-------选择DF里的N行数据-----")
        s = df.iloc[0:2]
        print(s)
    
        print("-------DF追加到最后-------")
        #ignore_index忽略带哦索引行
        print(df.append(s,ignore_index= True))
    
    
    if __name__ == '__main__':
        print_data()
    
    
    #结果如下
    
    -------DF-----
                       A         B         C         D
    2013-01-01  0.180073  1.027674  0.699021  0.211052
    2013-01-02  0.700873  0.893067  0.234802  1.378712
    2013-01-03 -0.318609 -0.291524  0.123771  1.057293
    2013-01-04 -0.145169  0.213432  0.285161 -0.231468
    2013-01-05 -0.916774 -1.284495  1.661716 -0.258821
    2013-01-06  0.460373 -2.351527 -0.462772 -0.587480
    2013-01-07 -1.149013 -1.290900  0.171418 -0.076885
    2013-01-08 -1.621095  0.704023 -0.706554  0.016696
    2013-01-09 -0.405135 -1.019510  0.863830 -1.316628
    -------选择DF里的N行数据-----
                       A         B         C         D
    2013-01-01  0.180073  1.027674  0.699021  0.211052
    2013-01-02  0.700873  0.893067  0.234802  1.378712
    -------DF追加到最后-------
               A         B         C         D
    0   0.180073  1.027674  0.699021  0.211052
    1   0.700873  0.893067  0.234802  1.378712
    2  -0.318609 -0.291524  0.123771  1.057293
    3  -0.145169  0.213432  0.285161 -0.231468
    4  -0.916774 -1.284495  1.661716 -0.258821
    5   0.460373 -2.351527 -0.462772 -0.587480
    6  -1.149013 -1.290900  0.171418 -0.076885
    7  -1.621095  0.704023 -0.706554  0.016696
    8  -0.405135 -1.019510  0.863830 -1.316628
    9   0.180073  1.027674  0.699021  0.211052
    10  0.700873  0.893067  0.234802  1.378712
    
    进程已结束,退出代码 0
    append()

    六、分组

      group by:指的是涵盖以下一项或者多项的步骤流程:

        a. 分割:按条件将数据分割成多组

        b. 应用:为每组单独应用函数

        c. 组合:将处理结果组合成一个数据结构

    import pandas as pd
    import numpy as np
    
    def create_numpay_dataform():
        df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                           'B': ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
                           'C': np.random.randn(8),
                           'D': np.random.randn(8)})
    
        return df
    
    def print_data():
    
        df = create_numpay_dataform()
    
        print("-------DF-----")
        print(df)
    
        print("-------单条件分组后在计算-----")
        print(df.groupby('A').sum())
    
        print("-------多条件分组在计算-------")
        print(df.groupby(["B","A"]).sum())
        print("-------我是华丽的分割线-------")
        print(df.groupby(["A", "B"]).sum())
    
    
    if __name__ == '__main__':
        print_data()
    
    
    #结果如下
    
    -------DF-----
         A      B         C         D
    0  foo    one  0.453100 -0.544181
    1  bar    one  1.692183 -0.253889
    2  foo    two -0.656308 -1.177487
    3  bar  three -1.078701  1.239209
    4  foo    two -0.866770 -0.949062
    5  bar    two -1.305346 -1.705380
    6  foo    one -0.259537 -1.492884
    7  foo  three -0.669982 -0.943082
    -------单条件分组后在计算-----
                C         D
    A                      
    bar -0.691863 -0.720059
    foo -1.999496 -5.106695
    -------多条件分组在计算-------
                      C         D
    B     A                      
    one   bar  1.692183 -0.253889
          foo  0.193563 -2.037065
    three bar -1.078701  1.239209
          foo -0.669982 -0.943082
    two   bar -1.305346 -1.705380
          foo -1.523078 -2.126549
    -------我是华丽的分割线-------
                      C         D
    A   B                        
    bar one    1.692183 -0.253889
        three -1.078701  1.239209
        two   -1.305346 -1.705380
    foo one    0.193563 -2.037065
        three -0.669982 -0.943082
        two   -1.523078 -2.126549
    
    进程已结束,退出代码 0
    groupby()

    七、数据透视表

      什么是数据透视表?

        数据透视表是一种交互式的表,可以自由选择多个字段的不同组合,用于快速汇总、分析大量数据中字段与字段之间的关联关系。使用数据透视表可以按照数据表格的不同字段从多个角度进行透视,并建立交叉表格,用以查看数据表格不同层面的汇总信息、分析结果以及摘要数据。

      数据透视表的优势?   

      • 对数值数据快速分类汇总,按分类和子分类查看数据信息。
      • 展开或折叠所关注的数据,快速查看摘要数据的明细信息。
      • 建立交叉表格(将行移动到列或将列移动到行),以查看数据的不同汇总。
      • 快速的计算数值数据的汇总信息、差异等。

      pivot_table():用法 pivot_table(data, values=None, index=None, columns=None,aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')

        四个最重要的参数:index,values,columns,aggfunc

          a. index:每个pivot_table必须拥有一个index,作为透视表的索引列,可以是一层索引,也可以是多层索引

          b. values:筛选需要计算的数据

          c. columns:Columns类似Index可以设置列层次字段,它不是一个必要参数,作为一种分割数据的可选方式。

          d. aggfunc:aggfunc参数可以设置我们对数据聚合时进行的函数操作,这个参数是一个函数。

    import pandas as pd
    import numpy as np
    
    def create_numpay_dataform():
    
        df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,
                           'B': ['A', 'B', 'C'] * 4,
                           'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                           'D': np.random.randn(12),
                           'E': np.random.randn(12)})
    
        return df
    
    def table_func(data):
        for i in data:
            if i > 0:
                return True
            else:
                return False
    
    def print_data():
    
        df = create_numpay_dataform()
    
        print("-------DF-----")
        print(df)
    
        print("-------数据透视表-----")
        print(pd.pivot_table(df, values='E', index=['A', 'B'],columns=["C"]))
        print("--------------------")
        print(pd.pivot_table(df, values='E', index=['A', 'B'],columns=["C"], aggfunc=[np.sum,np.mean]))
    
    
    if __name__ == '__main__':
        print_data()
    
    
    #结果如下
    
    -------DF-----
            A  B    C         D         E
    0     one  A  foo  0.596744  1.260272
    1     one  B  foo -0.560929  2.077597
    2     two  C  foo -1.326983 -0.997230
    3   three  A  bar  0.714451  0.520551
    4     one  B  bar  2.378704  0.336855
    5     one  C  bar -0.771644  0.109514
    6     two  A  foo -2.606868 -0.279142
    7   three  B  foo -0.775949 -1.383773
    8     one  C  foo  0.106014 -0.840803
    9     one  A  bar -0.877053  0.090785
    10    two  B  bar -1.594153 -1.002086
    11  three  C  bar -0.032272 -0.700847
    -------数据透视表-----
    C             bar       foo
    A     B                    
    one   A  0.090785  1.260272
          B  0.336855  2.077597
          C  0.109514 -0.840803
    three A  0.520551       NaN
          B       NaN -1.383773
          C -0.700847       NaN
    two   A       NaN -0.279142
          B -1.002086       NaN
          C       NaN -0.997230
                  sum                mean          
    C             bar       foo       bar       foo
    A     B                                        
    one   A  0.090785  1.260272  0.090785  1.260272
          B  0.336855  2.077597  0.336855  2.077597
          C  0.109514 -0.840803  0.109514 -0.840803
    three A  0.520551       NaN  0.520551       NaN
          B       NaN -1.383773       NaN -1.383773
          C -0.700847       NaN -0.700847       NaN
    two   A       NaN -0.279142       NaN -0.279142
          B -1.002086       NaN -1.002086       NaN
          C       NaN -0.997230       NaN -0.997230
    
    进程已结束,退出代码 0
    pivot_table()

      

  • 相关阅读:
    JS转义 escape()、encodeURI()、encodeURIComponent()区别详解
    PHP解决搜索时在URL地址栏输入中文字符搜索结果出现乱码
    CMSPRESS-PHP无限级分类2
    CMSPRESS-PHP无限级分类
    HTML5-CSS3-JavaScript(3)
    HTML5-CSS3-JavaScript(2)
    HTML5-CSS3-JavaScript(1)
    CSS3-Hover 效果 展示
    JAVA Socket地址绑定
    JAVA Socket无参构造方法的使用
  • 原文地址:https://www.cnblogs.com/ppzhang/p/13770584.html
Copyright © 2011-2022 走看看