zoukankan      html  css  js  c++  java
  • pandas基础

    pandas基础

    pandas介绍

    Python Data Analysis Library

    pandas是基于NumPy 的一种工具,该工具是为了解决数据分析任务而创建的。Pandas 纳入 了大量库和一些标准的数据模型,提供了高效地操作大型结构化数据集所需的工具。

    pandas核心数据结构

    数据结构是计算机存储、组织数据的方式。 通常情况下,精心选择的数据结构可以带来更高的运行或者存储效率。数据结构往往同高效的检索算法和索引技术有关。

    Series

    Series可以理解为一个一维的数组,只是index名称可以自己改动。类似于定长的有序字典,有Index和 value。

    """
    pandas的Series对象
    """
    import pandas as pd
    import numpy as np
    
    # 空Series对象
    s1 = pd.Series()
    print(s1)  # Series([], dtype: float64)
    # 通过数组创建Series对象
    data = np.array(['zs', 'ls', 'ww', 'zl'])
    s2 = pd.Series(data)
    print(s2)
    """
    0    zs
    1    ls
    2    ww
    3    zl
    dtype: object
    """
    
    # 修改索引标签
    s3 = pd.Series(data, index=['s001', 's002', 's003', 's004'])
    print(s3)
    """
    s001    zs
    s002    ls
    s003    ww
    s004    zl
    dtype: object
    """
    
    # 从字典创建一个Series
    data = {'s01': 'zs', 's02': 'li', 's03': 'ww', 's04': 'zl'}
    s4 = pd.Series(data)
    print(s4)
    """
    s01    zs
    s02    li
    s03    ww
    s04    zl
    dtype: object
    """
    
    #通过标量创建一个Series
    s5 = pd.Series(5,index=['a','b','c'])
    print(s5)
    """
    a    5
    b    5
    c    5
    dtype: int64
    """
    
    #从Series中读取数据 print(s3) """ s001 zs s002 ls s003 ww s004 zl dtype: object """ print(s3[0])#zs 通过下标访问 print(s3[:2])#通过切片访问 """ s001 zs s002 ls dtype: object """ print(s3['s003'])#ww #通过索引标签 print(s3[['s001','s003']])#通过索引标签组 """ s001 zs s003 ww dtype: object """

     pandas日期处理

    import pandas as pd
    
    # pandas识别的日期字符串格式
    s6 = pd.Series(['2011', '2011-01',
               '2011-01-02',
               '2012/02/01',
               '2011-01-02 08:00:00',
               '01 Jun 2012'])
    # to_datetime() 转换日期数据类型
    s6 = pd.to_datetime(s6)
    print(s6)
    """
    0   2011-01-01 00:00:00
    1   2011-01-01 00:00:00
    2   2011-01-02 00:00:00
    3   2012-02-01 00:00:00
    4   2011-01-02 08:00:00
    5   2012-06-01 00:00:00
    dtype: datetime64[ns]
    """
    # datetime类型数据支持日期运算
    delta = s6-pd.to_datetime('2011-01-01')
    
    print(delta)
    """
    0     0 days 00:00:00
    1     0 days 00:00:00
    2     1 days 00:00:00
    3   396 days 00:00:00
    4     1 days 08:00:00
    5   517 days 00:00:00
    dtype: timedelta64[ns]
    """
    #输出s6日期某字段的值
    print(s6.dt.quarter)
    """
    0    1
    1    1
    2    1
    3    1
    4    1
    5    2
    dtype: int64
    """
    # 获取偏移天数
    print(delta.dt.days)
    """
    0      0
    1      0
    2      1
    3    396
    4      1
    5    517
    """
    print(s6.dt.month)
    """
    0    1
    1    1
    2    1
    3    2
    4    1
    5    6
    dtype: int64
    """

    Series.dt提供了很多日期相关操作,如下:

    Series.dt.year    The year of the datetime.
    Series.dt.month    The month as January=1, December=12.
    Series.dt.day    The days of the datetime.
    Series.dt.hour    The hours of the datetime.
    Series.dt.minute    The minutes of the datetime.
    Series.dt.second    The seconds of the datetime.
    Series.dt.microsecond    The microseconds of the datetime.
    Series.dt.week    The week ordinal of the year.
    Series.dt.weekofyear    The week ordinal of the year.
    Series.dt.dayofweek    The day of the week with Monday=0, Sunday=6.
    Series.dt.weekday    The day of the week with Monday=0, Sunday=6.
    Series.dt.dayofyear    The ordinal day of the year.
    Series.dt.quarter    The quarter of the date.
    Series.dt.is_month_start    Indicates whether the date is the first day of the month.
    Series.dt.is_month_end    Indicates whether the date is the last day of the month.
    Series.dt.is_quarter_start    Indicator for whether the date is the first day of a quarter.
    Series.dt.is_quarter_end    Indicator for whether the date is the last day of a quarter.
    Series.dt.is_year_start    Indicate whether the date is the first day of a year.
    Series.dt.is_year_end    Indicate whether the date is the last day of the year.
    Series.dt.is_leap_year    Boolean indicator if the date belongs to a leap year.
    Series.dt.days_in_month    The number of days in the month.

    DateTimeIndex

    通过指定周期和频率,使用date.range()函数就可以创建日期序列。 默认情况下,范围的频率是天。

    import pandas as pd
    # 以日为频率
    datelist = pd.date_range('2019/08/21', periods=5)
    print(datelist)
    # 以月为频率
    datelist = pd.date_range('2019/08/21', periods=5,freq='M')
    print(datelist)
    # 构建某个区间的时间序列
    start = pd.datetime(2017, 11, 1)
    end = pd.datetime(2017, 11, 5)
    dates = pd.date_range(start, end)
    print(dates)

    bdate_range()用来表示商业日期范围,不同于date_range(),它不包括星期六和星期天。

    import pandas as pd
    datelist = pd.date_range('2011/11/03', periods=5)
    print(datelist)

     

    """
    datetimeindex
    """
    
    import pandas as pd
    # 以日为频率
    d = pd.date_range('2019-01-01', periods=7)
    print(d)
    """
    DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
                   '2019-01-05', '2019-01-06', '2019-01-07'],
                  dtype='datetime64[ns]', freq='D')
    """
    print(d.dtype)
    # datetime64[ns]
    print(type(d))#类型
    # <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
    
    #生成一组时间,默认以D向后延续fred
    d = pd.date_range('2019-10-01',periods=7)
    print(d)
    """
    DatetimeIndex(['2019-10-01', '2019-10-02', '2019-10-03', '2019-10-04',
                   '2019-10-05', '2019-10-06', '2019-10-07'],
                  dtype='datetime64[ns]', freq='D')
    """
    
    #生成一组时间,以M为fred 以月为频率
    d2 = pd.date_range('2019-10-01',periods=5,freq='M')
    print(d2)
    """
    DatetimeIndex(['2019-10-31', '2019-11-30', '2019-12-31', '2020-01-31',
                   '2020-02-29'],
                  dtype='datetime64[ns]', freq='M')
    """
    
    #设置生成一组时间:[start,end]
    d3 = pd.date_range('2019-10-1','2019-10-7')
    print(d3)
    """
    DatetimeIndex(['2019-10-01', '2019-10-02', '2019-10-03', '2019-10-04',
                   '2019-10-05', '2019-10-06', '2019-10-07'],
                  dtype='datetime64[ns]', freq='D')
    """
    #生成一组时间,只包含工作日
    d4 = pd.bdate_range('2019-10-1',periods=7)
    print(d4)
    """
    DatetimeIndex(['2019-10-01', '2019-10-02', '2019-10-03', '2019-10-04',
                   '2019-10-07', '2019-10-08', '2019-10-09'],
                  dtype='datetime64[ns]', freq='B')
    """

    DataFrame

    DataFrame是一个类似于表格的数据类型,可以理解为一个二维数组,索引有两个维度,可更改。DataFrame具有以下特点:

    • 潜在的列是不同的类型

    • 大小可变

    • 标记轴(行和列)

    • 可以对行和列执行算术运算

    import pandas as pd
    
    # 创建一个空的DataFrame
    df = pd.DataFrame()
    print(df)
    """
    Empty DataFrame   #空的
    Columns: []       #列
    Index: []         #索引
    """
    
    # 从列表创建DataFrame
    data = ['Tom', 'Jerry', 'Dog', 'Lily']
    df = pd.DataFrame(data)
    print(df)
    """
           0
    0    Tom
    1  Jerry
    2    Dog
    3   Lily
    """
    
    # 通过二维数组创建DataFrame
    # 指定列索引标签columns=['Name','Age'],不指定默认从0开始
    data = [['Alex', 10],
            ['Bob', 12],
            ['Clarke', 13]
            ]
    df = pd.DataFrame(data, columns=['Name', 'Age'])
    print(df)
    """
         Name  Age
    0    Alex   10
    1     Bob   12
    2  Clarke   13
    """
    
    
    data = [['Alex', 10], ['Bob', 12], ['Clarke', 13]]
    df = pd.DataFrame(data, columns=['Name', 'Age'], dtype=float)
    print(df)
    """
         Name   Age
    0    Alex  10.0
    1     Bob  12.0
    2  Clarke  13.0
    """
    # 通过列表套字典的方式创建DataFrame
    data = [{'a': 1, 'b': 2}, 
            {'a': 5, 'b': 10, 'c': 20}]
    df = pd.DataFrame(data)
    print(df)
    """
       a   b     c
    0  1   2   NaN
    1  5  10  20.0
    """
    
    # 从字典来创建DataFrame
    data = {'Name': ['Tom', 'Jack', 'Steve', 'Ricky'], 
            'Age': [28, 34, 29, 42]}
    df = pd.DataFrame(data, index=['s1', 's2', 's3', 's4'])
    print(df)
    """
         Name  Age
    s1    Tom   28
    s2   Jack   34
    s3  Steve   29
    s4  Ricky   42
    """
    
    data = {'one': pd.Series([1, 2, 3], 
                             index=['a', 'b', 'c']),
            'two': pd.Series([1, 2, 3, 4], 
                             index=['a', 'b', 'c', 'd'])}
    df = pd.DataFrame(data)
    print(df)
    """
       one  two
    a  1.0    1
    b  2.0    2
    c  3.0    3
    d  NaN    4
    """

    核心数据结构操作

    列访问

    DataFrame的单列数据为一个Series。根据DataFrame的定义可以 知晓DataFrame是一个带有标签的二维数组,每个标签相当每一列的列名。

    import pandas as pd
    
    d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
         'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
    
    df = pd.DataFrame(d)
    print(df['one'])
    """
    a    1.0
    b    2.0
    c    3.0
    d    NaN
    Name: one, dtype: float64
    """
    print(df[['one', 'two']])
    """
       one  two
    a  1.0    1
    b  2.0    2
    c  3.0    3
    d  NaN    4
    """

    列添加

    DataFrame添加一列的方法非常简单,只需要新建一个列索引。并对该索引下的数据进行赋值操作即可。

    import pandas as pd
    
    data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],
            'Age':[28,34,29,42]}
    df = pd.DataFrame(data, index=['s1','s2','s3','s4'])
    
    #访问Name列
    print(df['Name'],type(df['Name']))
    """
    s1      Tom
    s2     Jack
    s3    Steve
    s4    Ricky
    Name: Name, dtype: object <class 'pandas.core.series.Series'>
    """
    
    #添加成绩列
    df['score']=pd.Series([90, 80, 70, 60],
                          index=['s1','s2','s3','s4'])
    print(df)
    """
         Name  Age  score
    s1    Tom   28     90
    s2   Jack   34     80
    s3  Steve   29     70
    s4  Ricky   42     60
    """

    列删除

    删除某列数据需要用到pandas提供的方法pop,pop方法的用法如下:

    import pandas as pd
    
    d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
         'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']),
         'three' : pd.Series([10, 20, 30], index=['a', 'b', 'c'])}
    df = pd.DataFrame(d)
    print("dataframe is:")
    print(df)
    """
    dataframe is:
       one  two  three
    a  1.0    1   10.0
    b  2.0    2   20.0
    c  3.0    3   30.0
    d  NaN    4    NaN
    """
    # 删除一列: one
    del(df['one'])
    print(df)
    """
       two  three
    a    1   10.0
    b    2   20.0
    c    3   30.0
    d    4    NaN
    """
    
    #调用pop方法删除一列
    df.pop('two')
    print(df)
    """
       three
    a   10.0
    b   20.0
    c   30.0
    d    NaN
    """

    行访问

    如果只是需要访问DataFrame某几行数据的实现方式则采用数组的选取方式,使用 ":" 即可:

    import pandas as pd
    
    d = {'one' : pd.Series([1, 2, 3], 
                  index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4],
                  index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print(df[2:4]) """ one two c 3.0 3 d NaN 4 """

    loc方法是针对DataFrame索引名称的切片方法。loc方法使用方法如下:

    import pandas as pd
    
    d = {'one' : pd.Series([1, 2, 3],
                           index=['a', 'b', 'c']),
         'two' : pd.Series([1, 2, 3, 4],
                           index=['a', 'b', 'c', 'd'])}
    
    df = pd.DataFrame(d)
    # 通过索引名称访问
    print(df.loc['b'])
    """
    one    2.0
    two    2.0
    Name: b, dtype: float64
    """
    print(df.loc[['a', 'b']])
    """
       one  two
    a  1.0    1
    b  2.0    2
    """

    iloc和loc区别是iloc接收的必须是行索引和列索引的位置。iloc方法的使用方法如下:

    import pandas as pd
    
    d = {'one' : pd.Series([1, 2, 3],
                           index=['a', 'b', 'c']),
         'two' : pd.Series([1, 2, 3, 4],
                           index=['a', 'b', 'c', 'd'])}
    
    df = pd.DataFrame(d)
    print(df)
    """
       one  two
    a  1.0    1
    b  2.0    2
    c  3.0    3
    d  NaN    4
    """
    #通过索引访问
    print(df.iloc[2])
    """
    one    3.0
    two    3.0
    Name: c, dtype: float64
    """
    print(df.iloc[[2, 3]])
    """
       one  two
    c  3.0    3
    d  NaN    4
    """

    行添加

    import pandas as pd
    
    df = pd.DataFrame([['zs', 12],
                       ['ls', 4]],
                      columns = ['Name','Age'])
    df2 = pd.DataFrame([['ww', 16],
                        ['zl', 8]],
                       columns = ['Name','Age'])
    
    df = df.append(df2)
    print(df)
    """
      Name  Age
    0   zs   12
    1   ls    4
    0   ww   16
    1   zl    8
    """

    行删除

     使用索引标签从DataFrame中删除或删除行。 如果标签重复,则会删除多行。

    import pandas as pd
    
    df = pd.DataFrame([['zs', 12],
                       ['ls', 4]],
                      columns = ['Name','Age'])
    df2 = pd.DataFrame([['ww', 16],
                        ['zl', 8]],
                       columns = ['Name','Age'])
    df = df.append(df2)
    print(df)
    """
      Name  Age
    0   zs   12
    1   ls    4
    0   ww   16
    1   zl    8
    """
    # 删除index为0的行
    df = df.drop(0)
    print(df)
    """
      Name  Age
    1   ls    4
    1   zl    8
    """

    修改DataFrame中的数据

    更改DataFrame中的数据,原理是将这部分数据提取出来,重新赋值为新的数据。

    import pandas as pd
    
    df = pd.DataFrame([['zs', 12],
                       ['ls', 4]],
                      columns = ['Name','Age'])
    df2 = pd.DataFrame([['ww', 16],
                        ['zl', 8]],
                       columns = ['Name','Age'])
    df = df.append(df2)
    print(df)
    """
      Name  Age
    0   zs   12
    1   ls    4
    0   ww   16
    1   zl    8
    """
    df['Name'][0] = 'Tom'
    print(df)
    """
      Name  Age
    0  Tom   12
    1   ls    4
    0  Tom   16
    1   zl    8
    """

    DataFrame常用属性

    编号属性或方法描述
    1 axes 返回 行/列 标签(index)列表。
    2 dtype 返回对象的数据类型(dtype)。
    3 empty 如果系列为空,则返回True
    4 ndim 返回底层数据的维数,默认定义:1
    5 size 返回基础数据中的元素数。
    6 values 将系列作为ndarray返回。
    7 head() 返回前n行。
    8 tail() 返回最后n行。

     实例代码:

    import pandas as pd
    
    data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],
            'Age':[28,34,29,42]}
    df = pd.DataFrame(data, index=['s1','s2','s3','s4'])
    df['score']=pd.Series([90, 80, 70, 60],
                          index=['s1','s2','s3','s4'])
    # print(df)
    """
         Name  Age  score
    s1    Tom   28     90
    s2   Jack   34     80
    s3  Steve   29     70
    s4  Ricky   42     60
    """
    print(df.axes)
    #[Index(['s1', 's2', 's3', 's4'], dtype='object'), Index(['Name', 'Age', 'score'], dtype='object')]
    print(df['Age'].dtype)#int64
    print(df.empty)#False
    print(df.ndim)#2
    print(df.size)#12
    print(df.values)
    """
    [['Tom' 28 90]
     ['Jack' 34 80]
     ['Steve' 29 70]
     ['Ricky' 42 60]]
    """
    print(df.head(3)) # df的前三行
    """
         Name  Age  score
    s1    Tom   28     90
    s2   Jack   34     80
    s3  Steve   29     70
    """
    print(df.tail(3)) # df的后三行
    """
         Name  Age  score
    s2   Jack   34     80
    s3  Steve   29     70
    s4  Ricky   42     60
    """

     

  • 相关阅读:
    Spring cloud父项目的建立
    Spring cloud简介
    ssm整合
    springboot-注解讲解
    springboot-helloworld实现
    Java线程池七个参数
    JVM性能调优
    SpringBoot的特性
    Spring与Spring Boot核心注解
    mybatis中#{} 和 ${}的区别
  • 原文地址:https://www.cnblogs.com/maplethefox/p/11495839.html
Copyright © 2011-2022 走看看