zoukankan      html  css  js  c++  java
  • pandas

      pandas是一个强大的python数据分析的工具包,是基于NumPy构建的

      主要功能:

    1. 具备对其功能的数据结构DataFrame Series
    2. 集成时间序列功能
    3. 提供丰富的数学运算和操作
    4. 灵活处理缺失数据

      安装:pip install pandas

      引用:import pandas as pd

    Series-一维数据对象

      Series是一种类似于一维数组的对象,由一组数据和一组与之相关的数据标签(索引)组成

      创建方式

    In [206]: import pandas as pd
    
    In [207]: pd.Series([4,7,-5,3])
    Out[207]: 
    0    4
    1    7
    2   -5
    3    3
    dtype: int64
    
    In [208]: pd.Series([4,7,-5,3], index=['a','b','c','d'])
    Out[208]: 
    a    4
    b    7
    c   -5
    d    3
    dtype: int64
    
    In [209]: pd.Series({'a':1,'b':2})
    Out[209]: 
    a    1
    b    2
    dtype: int64
    
    In [210]: pd.Series(0, index=['a','b','c','d'])
    Out[210]: 
    a    0
    b    0
    c    0
    d    0
    dtype: int64
    

       获取值数组和索引数组: values属性和index属性

    In [211]: a = pd.Series([4,7,-5,3], index=['a','b','c','d'])
    
    In [212]: a.values
    Out[212]: array([ 4,  7, -5,  3], dtype=int64)
    
    In [214]: a.index
    Out[214]: Index(['a', 'b', 'c', 'd'], dtype='object')
    

       Series比较像列表(数组)和字典的结合体

    Series-使用特性

      Series支持array的特性

    • 与标量运算  sr*2
    In [217]: sr
    Out[217]: 
    a    4
    b    7
    c   -5
    d    3
    dtype: int64
    
    In [218]: sr * 2
    Out[218]: 
    a     8
    b    14
    c   -10
    d     6
    dtype: int64
    
    •  与变量运算  sr1+sr2  标签一致的情况下,数值才会相加,否则会增加标签
    In [221]: sr2 = pd.Series([1,2,3,4],index=['a','b','c','d'])
    
    In [222]: sr + sr2
    Out[222]: 
    a    5
    b    9
    c   -2
    d    7
    dtype: int64
    
    •  索引 sr[0],sr[[1,2,4]]
    In [224]: sr
    Out[224]: 
    a    4
    b    7
    c   -5
    d    3
    dtype: int64
    
    In [225]: sr[0]
    Out[225]: 4
    
    In [226]: sr[[0,2,3]]
    Out[226]: 
    a    4
    c   -5
    d    3
    dtype: int64
    
    •  切片 sr[:2]
    In [227]: sr[:2]
    Out[227]: 
    a    4
    b    7
    dtype: int64
    
    •  通用函数 np.abs(sr)
    In [228]: sr
    Out[228]: 
    a    4
    b    7
    c   -5
    d    3
    dtype: int64
    
    In [229]: np.abs(sr)
    Out[229]: 
    a    4
    b    7
    c    5
    d    3
    dtype: int64
    
    •  布尔值过滤
    In [230]: sr
    Out[230]: 
    a    4
    b    7
    c   -5
    d    3
    dtype: int64
    
    In [231]: sr[sr>0]
    Out[231]: 
    a    4
    b    7
    d    3
    dtype: int64
    

      Series支持字典的特性(标签)

    • 从字典创建Series  Series(dic)
    In [232]: pd.Series({'a':1,'b':5})
    Out[232]: 
    a    1
    b    5
    dtype: int64
    
    •  标签in运算判断  ‘a’ in sr,循环时,默认循环值
    In [233]: sr
    Out[233]: 
    a    4
    b    7
    c   -5
    d    3
    dtype: int64
    
    In [234]: 'a' in sr
    Out[234]: True
    
    In [236]: for i in sr:
         ...:     print(i)
    4
    7
    -5
    3
    
    •  键索引 sr['a'], sr[['a','b','d']]
    In [237]: sr
    Out[237]: 
    a    4
    b    7
    c   -5
    d    3
    dtype: int64
    
    In [238]: sr['a']
    Out[238]: 4
    
    In [239]: sr[['a','c']]
    Out[239]: 
    a    4
    c   -5
    dtype: int64
    
    In [240]: sr['a':'c']  #起始值和终点值都能取到
    Out[240]: 
    a    4
    b    7
    c   -5
    dtype: int64
    

    Series-整数索引

      如果series对象里的键为整数时,就存在键取值和索引取值搞混的问题,默认是键取值

    In [40]: sr = pd.Series(np.arange(10))
    
    In [41]: sr
    Out[41]:
    0    0
    1    1
    2    2
    3    3
    4    4
    5    5
    6    6
    7    7
    8    8
    9    9
    dtype: int32
    
    
    In [44]: sr2 = sr[5:].copy()
    
    In [45]: sr2
    Out[45]:
    5    5
    6    6
    7    7
    8    8
    9    9
    dtype: int32
    
    In [46]: sr2[5]  #取键为5的值,如果是索引,肯定报错
    Out[46]: 5
    
    In [47]: sr2[-1]
    ---------------------------------------------------------------------------
    KeyError                                  Traceback (most recent call last)
    <ipython-input-47-3882bebf0859> in <module>()
    ----> 1 sr2[-1]
    
    C:python serveranacondalibsite-packagespandascoreseries.py in __getitem__(self, key)
        621         key = com._apply_if_callable(key, self)
        622         try:
    --> 623             result = self.index.get_value(self, key)
        624
        625             if not is_scalar(result):
    
    C:python serveranacondalibsite-packagespandascoreindexesase.py in get_value(self, series, key)
       2558         try:
       2559             return self._engine.get_value(s, k,
    -> 2560                                           tz=getattr(series.dtype, 'tz', None))
       2561         except KeyError as e1:
       2562             if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:
    
    pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
    
    pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
    
    pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
    
    pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
    
    pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
    
    KeyError: -1
    

       解决方法:指定取值方法,loc标签取值,iloc就是索引取值

    In [48]: sr2.loc[5]   #ioc就是 标签 或 键取值
    Out[48]: 5
    
    In [50]: sr2.iloc[4]  #iloc就下标  或 索引取值
    Out[50]: 9
    
    In [51]: sr2.iloc[-1]
    Out[51]: 9
    

    Series数据对齐

      pandas在进行两个Series对象的运算时,会按索引进行对齐然后计算

      如果两个Series对象的索引不完全相同,则结果的索引是两个操作数索引的并集

      如果只有一个对象在某个索引下有值,则结果中该索引的值为nan(缺失值)

    In [6]: sr1 = pd.Series([4,9,100],index=['a','b','c'])
    
    In [7]: sr1
    Out[7]:
    a      4
    b      9
    c    100
    dtype: int64
    
    In [8]: sr2 = pd.Series([4,5,6],index=['b','c','d'])
    
    In [9]: sr2
    Out[9]:
    b    4
    c    5
    d    6
    dtype: int64
    
    In [10]: sr1 + sr2
    Out[10]:
    a      NaN
    b     13.0
    c    105.0
    d      NaN
    dtype: float64
    

       如果处理缺失值呢,比如上面的,a处理为4,另外一组没有的处理成0

      你可以用灵活的算术方法:add,sub,div,mul

      sr1 + sr2  等同 sr1.add(sr2),利用函数的填充参数fill_value处理缺失值

    In [10]: sr1 + sr2
    Out[10]:
    a      NaN
    b     13.0
    c    105.0
    d      NaN
    dtype: float64
    
    In [11]: sr1.add(sr2)
    Out[11]:
    a      NaN
    b     13.0
    c    105.0
    d      NaN
    dtype: float64
    
    In [12]: sr1.add(sr2,fill_value=0)
    Out[12]:
    a      4.0
    b     13.0
    c    105.0
    d      6.0
    dtype: float64
    

    Series-缺失数据

      缺失数据:使用NaN(Not a Number)来表示缺失数据,其值等于np.nan,内置的None值也会被当做NaN处理

      提供这么几个方法帮助我们处理缺失值

    1. dropna() 过滤掉值为NaN的行
    2. fillna 填充缺失数据
    3. isnull 返回布尔数组,缺失值对应为True
    4. notnull 返回布尔数组,缺失值对为False

      第一种方式:扔掉缺失值

    In [61]: sr = sr1 + sr2
    
    In [62]: sr
    Out[62]:
    a    33.0
    b     NaN
    c    32.0
    d    45.0
    dtype: float64
    
    In [63]: sr.isnull()  #判断是否为nan值
    Out[63]:
    a    False
    b     True
    c    False
    d    False
    dtype: bool
    
    In [64]: sr.notnull()  #不是nan值,那就可以通过这个,结合series的过滤掉缺失值
    Out[64]:
    a     True
    b    False
    c     True
    d     True
    dtype: bool
    
    In [65]: sr[sr.notnull()] #过滤
    Out[65]:
    a    33.0
    c    32.0
    d    45.0
    dtype: float64
    
    In [66]: sr.dropna()  #series对象里本身就提供一个扔掉nan值的方法
    Out[66]:
    a    33.0
    c    32.0
    d    45.0
    dtype: float64
    

       第二种处理缺失值的方式:填充

    In [68]: sr.fillna(0)  #填充0,还可以填充均值sr.mean()  均值函数会跳过nan值
    Out[68]:
    a    33.0
    b     0.0
    c    32.0
    d    45.0
    dtype: float64
    
    In [69]: sr = sr.fillna(0)  #由于不会对已有对象进行修改,需要重新赋值
    

    DataFrame-二维数据对象

      DataFrame是一个表格型的数据结构,含有一组有序的列,DataFrame可以被看做是由Series组成的字典,并且共用一个索引

      创建方式

    In [3]: pd.DataFrame({'one':[1,2,3,4], 'two':[4,3,2,1]})
    Out[3]: 
       one  two
    0    1    4
    1    2    3
    2    3    2
    3    4    1
    
    In [4]: pd.DataFrame({'one': pd.Series([1,2,3], index=['a','b','c']), 'two': pd.Series([1,2,3,4],index=['b','a','c','d'])})
    Out[4]: 
       one  two
    a  1.0    2
    b  2.0    1
    c  3.0    3
    d  NaN    4  #如果有缺失值,就以nan值返回
    
    In [5]: pd.DataFrame({'one':[1,2,3,4], 'two':[4,3,2,1]}, index=['a','b','c','d'])
    Out[5]: 
       one  two
    a    1    4
    b    2    3
    c    3    2
    d    4    1
    

    DataFrame-常用属性

    • index  获取索引(行名)
    • values  获取值数组
    • columns  获取列索引(列名)
    In [5]: pd.DataFrame({'one':[1,2,3,4], 'two':[4,3,2,1]}, index=['a','b','c','d'])
    Out[5]: 
       one  two
    a    1    4
    b    2    3
    c    3    2
    d    4    1
    
    In [6]: df = _5
    
    In [7]: df.index
    Out[7]: Index(['a', 'b', 'c', 'd'], dtype='object')
    
    In [8]: df.values
    Out[8]: 
    array([[1, 4],
           [2, 3],
           [3, 2],
           [4, 1]], dtype=int64)
    
    In [9]: df.columns
    Out[9]: Index(['one', 'two'], dtype='object')
    
    •  T  转置,行列对换
    In [10]: df
    Out[10]: 
       one  two
    a    1    4
    b    2    3
    c    3    2
    d    4    1
    
    In [11]: df.T
    Out[11]: 
         a  b  c  d
    one  1  2  3  4
    two  4  3  2  1
    
    •  describe()   获取快速统计,主要统计每列中个数,平均数,最大,最小,标准差,中位数等
    In [13]: df.describe()  #对列进行统计
    Out[13]:
                one       two
    count  4.000000  4.000000  #个数(nan不包括)
    mean   2.500000  5.500000  #均值
    std    1.290994  1.290994  #标准差
    min    1.000000  4.000000  #最小
    25%    1.750000  4.750000  
    50%    2.500000  5.500000  #中位数
    75%    3.250000  6.250000
    max    4.000000  7.000000  #最大
    

    DataFrame-索引和切片

      DataFrame是一个二维数据类型,所以有行索引和列索引

      DataFrame同样可以通过标签和位置两种方式来进行索引和切片

      列表索引方式

      获取时先列后行,支持只获取列,但是不支持只获取行

    In [18]: df
    Out[18]: 
       one  two
    a    1    4
    b    2    3
    c    3    2
    d    4    1
    
    InIn [19]: 
    
    
    In [19]: df['one']['a']  #中括号取值,先列后行
    Out[19]: 1
    
    In [20]: df['one']  #单取列
    Out[20]: 
    a    1
    b    2
    c    3
    d    4
    Name: one, dtype: int64
    
    In [21]: df['a']  #报错  #中括号取值方式  不支持只取行,因为行不是Series对象,列才是
    

      行列索引方式

    • loc属性  标签获取方式(行名和列名获取)
    • iloc属性  索引获取方式

      使用方法:逗号隔开,前为行索引,后为列索引

      行/列索引部分可以是常规索引、切片、布尔值索引、花式索引任意搭配

    In [22]: df
    Out[22]: 
       one  two
    a    1    4
    b    2    3
    c    3    2
    d    4    1
    
    In [23]: df.loc['a',] #loc方式就支持光取行
    Out[23]: 
    one    1
    two    4
    Name: a, dtype: int64
    
    In [24]: df.loc[['a','c'],] #支持花式索引
    Out[24]: 
       one  two
    a    1    4
    c    3    2
    
    In [25]: df.loc['a':'c','one']
    Out[25]: 
    a    1
    b    2
    c    3
    Name: one, dtype: int64
    
    In [26]: df.iloc[0] #iloc方式就支持光取行
    Out[26]: 
    one    1
    two    4
    Name: a, dtype: int64
    
    In [27]: df.iloc[0][1]
    Out[27]: 4
    
    In [28]: df.iloc[0,1]
    Out[28]: 4
    

    DataFrame-数据对齐与缺失数据

      DataFrame对象在运算时,同样会进行数据对齐,其行索引和列索引分别对齐

    In [29]: pd.DataFrame({'one':[1,2,3,4],'two':[4,5,6,7]}, index=['a','b','c','d'])
    Out[29]: 
       one  two
    a    1    4
    b    2    5
    c    3    6
    d    4    7
    
    In [30]: df = _29
    
    In [31]: df2 = pd.DataFrame({'two':[7,8,7,8],'one':[8,9,8,8]}, index=['a','c','d','b'])
    
    In [32]: df2
    Out[32]: 
       two  one
    a    7    8
    c    8    9
    d    7    8
    b    8    8
    
    In [33]: df + df2
    Out[33]: 
       one  two
    a    9   11
    b   10   13
    c   12   14
    d   12   14
    

      缺失值处理方式一:填充  fillna()

    In [35]: df.loc['e', 'one'] = np.nan
    
    In [36]: df.loc['e', 'two'] = 10
    
    In [37]: df.loc['f', 'one'] = np.nan
    
    In [38]: df.loc['f', 'two'] = np.nan
    
    In [39]: df
    Out[39]: 
       one   two
    a  1.0   4.0
    b  2.0   5.0
    c  3.0   6.0
    d  4.0   7.0
    e  NaN  10.0
    f  NaN   NaN
    
    In [40]: df.fillna(0)
    Out[40]: 
       one   two
    a  1.0   4.0
    b  2.0   5.0
    c  3.0   6.0
    d  4.0   7.0
    e  0.0  10.0
    f  0.0   0.0
    

      缺失值处理方式二:扔掉

    • dropna()   axis指定操作删除对象类型是行还是列,默认为0就是行,1为列      where指定什么情况下删除,any表示有nan就删除,而all表示行或列中都为nan删除
    In [39]: df2.dropna()  #默认是how=any
    Out[39]:
       one  two
    a  8.0  7.0
    c  9.0  8.0
    d  8.0  7.0
    b  8.0  8.0
    
    In [40]: df2.dropna(how='all')  #删除所有列都为nan的行
    Out[40]:
       one   two
    a  8.0   7.0
    c  9.0   8.0
    d  8.0   7.0
    b  8.0   8.0
    e  NaN  10.0
    
    In [41]: df2.dropna(how='any')  #删除含nan值的行
    Out[41]:
       one  two
    a  8.0  7.0
    c  9.0  8.0
    d  8.0  7.0
    b  8.0  8.0
    
    In [42]: df.loc['a','one'] = np.nan
    
    In [43]: df
    Out[43]:
       one  two
    a  NaN    4
    b  2.0    5
    c  3.0    6
    d  4.0    7
    
    In [44]: df.dropna(axis=1)  #删除含nan值的列
    Out[44]:
       two
    a    4
    b    5
    c    6
    d    7
    
    • isnull()
    • notnull()

    pandas-常用方法

    • mean(axis=0, skipna=True)  对列(行)求平均值,默认0为列
    • sum(axis=1)  对列(行)求和
    In [45]: df
    Out[45]:
       one  two
    a  NaN    4
    b  2.0    5
    c  3.0    6
    d  4.0    7
    
    In [46]: df.mean()  #默认对列求均值
    Out[46]:
    one    3.0
    two    5.5
    dtype: float64
    
    In [47]: df.mean(axis=1)  #对行求均值
    Out[47]:
    a    4.0
    b    3.5
    c    4.5
    d    5.5
    dtype: float64
    
    In [48]: df.sum()  #对列求和
    Out[48]:
    one     9.0
    two    22.0
    dtype: float64
    
    • sort_index(axis=0,ascending=True)  对列(行)索引排序,ascending为True时,为升序,False为降序
    • sort_values(by,axis=0,ascending=True)  对列(行)的值排序  by为哪一列或哪一行
    In [49]: df.sort_values(by='two')  #对某列值进行升序
    Out[49]:
       one  two
    a  NaN    4
    b  2.0    5
    c  3.0    6
    d  4.0    7
    
    In [50]: df.sort_values(by='two',ascending=False)  #对某列进行降序
    Out[50]:
       one  two
    d  4.0    7
    c  3.0    6
    b  2.0    5
    a  NaN    4
    
    In [52]: df.sort_values(by='a',ascending=False,axis=1)  #对某行进行降序
    Out[52]:
       two  one
    a    4  NaN
    b    5  2.0
    c    6  3.0
    d    7  4.0
    
    In [53]: df.sort_values(by='one')  #nan值不参与排序,放到最后
    Out[53]:
       one  two
    b  2.0    5
    c  3.0    6
    d  4.0    7
    a  NaN    4
    
    In [54]: df.sort_values(by='one',ascending=False)
    Out[54]:
       one  two
    d  4.0    7
    c  3.0    6
    b  2.0    5
    a  NaN    4
    
    In [55]: df.sort_index()  #按行升序
    Out[55]:
       one  two
    a  NaN    4
    b  2.0    5
    c  3.0    6
    d  4.0    7
    
    In [56]: df.sort_index(ascending=False)  #按行降序
    Out[56]:
       one  two
    d  4.0    7
    c  3.0    6
    b  2.0    5
    a  NaN    4
    
    In [57]: df.sort_index(axis=1)  #按列排
    Out[57]:
       one  two
    a  NaN    4
    b  2.0    5
    c  3.0    6
    d  4.0    7
    
    In [58]: df.sort_index(ascending=False,axis=1)
    Out[58]:
       two  one
    a    4  NaN
    b    5  2.0
    c    6  3.0
    d    7  4.0
    

      其他

    • apply(func, axis=0)  将自定义函数应用在各行或者各列上,func可返回标量或者Series
    • applymap(func)  将函数应用在DataFrame各个元素上
    • map(func)  将函数应用在Series各个元素上

    pandas-时间对象处理

      生成时间对象数组:date_range

    • start  开始时间
    • end  结束时间
    • periods  时间长度
    • freq  时间频率,默认为'D', 可选H(our) W(eek) B(usiness) S(emi-) M(onth) (min)T(es) S(econd), A(year)
    In [71]: pd.date_range('2018-01-01',periods=10)
    Out[71]: 
    DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
                   '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08',
                   '2018-01-09', '2018-01-10'],
                  dtype='datetime64[ns]', freq='D')
    
    In [72]: pd.date_range('2018-01-01','2030-01-01',freq='A')
    Out[72]: 
    DatetimeIndex(['2018-12-31', '2019-12-31', '2020-12-31', '2021-12-31',
                   '2022-12-31', '2023-12-31', '2024-12-31', '2025-12-31',
                   '2026-12-31', '2027-12-31', '2028-12-31', '2029-12-31'],
                  dtype='datetime64[ns]', freq='A-DEC')
    

       时间序列就是以时间对象为索引的Series或DataFrame

      datetime对象作为索引时是存储在DatetimeIndex对象中的

    In [73]: sr = pd.Series(np.arange(20), index=pd.date_range('2018-01-01', periods=20))
    
    In [74]: sr
    Out[74]: 
    2018-01-01     0
    2018-01-02     1
    2018-01-03     2
    2018-01-04     3
    2018-01-05     4
    2018-01-06     5
    2018-01-07     6
    2018-01-08     7
    2018-01-09     8
    2018-01-10     9
    2018-01-11    10
    2018-01-12    11
    2018-01-13    12
    2018-01-14    13
    2018-01-15    14
    2018-01-16    15
    2018-01-17    16
    2018-01-18    17
    2018-01-19    18
    2018-01-20    19
    Freq: D, dtype: int32
    
    In [75]: sr.index
    Out[75]: 
    DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
                   '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08',
                   '2018-01-09', '2018-01-10', '2018-01-11', '2018-01-12',
                   '2018-01-13', '2018-01-14', '2018-01-15', '2018-01-16',
                   '2018-01-17', '2018-01-18', '2018-01-19', '2018-01-20'],
                  dtype='datetime64[ns]', freq='D')
    

       时间序列特殊功能:

    • 传入'年'或'年月'作为切片方式
    In [32]: sr = pd.Series(np.arange(1000),index=pd.date_range('2018-01-01',periods=1000))
    
    In [33]: sr['2018-03']  #切某年的某个月
    Out[33]:
    2018-03-01    59
    2018-03-02    60
    2018-03-03    61
    2018-03-04    62
    2018-03-05    63
    2018-03-06    64
    2018-03-07    65
    2018-03-08    66
    2018-03-09    67
    2018-03-10    68
    2018-03-11    69
    2018-03-12    70
    2018-03-13    71
    2018-03-14    72
    2018-03-15    73
    2018-03-16    74
    2018-03-17    75
    2018-03-18    76
    2018-03-19    77
    2018-03-20    78
    2018-03-21    79
    2018-03-22    80
    2018-03-23    81
    2018-03-24    82
    2018-03-25    83
    2018-03-26    84
    2018-03-27    85
    2018-03-28    86
    2018-03-29    87
    2018-03-30    88
    2018-03-31    89
    Freq: D, dtype: int32
    
    In [35]: sr['2019']  #切某年
    Out[35]:
    2019-01-01    365
    2019-01-02    366
    2019-01-03    367
    2019-01-04    368
    2019-01-05    369
    2019-01-06    370
    2019-01-07    371
    2019-01-08    372
    2019-01-09    373
    2019-01-10    374
    2019-01-11    375
    2019-01-12    376
    2019-01-13    377
    2019-01-14    378
    2019-01-15    379
    2019-01-16    380
    2019-01-17    381
    2019-01-18    382
    2019-01-19    383
    2019-01-20    384
    2019-01-21    385
    2019-01-22    386
    2019-01-23    387
    2019-01-24    388
    2019-01-25    389
    2019-01-26    390
    2019-01-27    391
    2019-01-28    392
    2019-01-29    393
    2019-01-30    394
                 ...
    2019-12-02    700
    2019-12-03    701
    2019-12-04    702
    2019-12-05    703
    2019-12-06    704
    2019-12-07    705
    2019-12-08    706
    2019-12-09    707
    2019-12-10    708
    2019-12-11    709
    2019-12-12    710
    2019-12-13    711
    2019-12-14    712
    2019-12-15    713
    2019-12-16    714
    2019-12-17    715
    2019-12-18    716
    2019-12-19    717
    2019-12-20    718
    2019-12-21    719
    2019-12-22    720
    2019-12-23    721
    2019-12-24    722
    2019-12-25    723
    2019-12-26    724
    2019-12-27    725
    2019-12-28    726
    2019-12-29    727
    2019-12-30    728
    2019-12-31    729
    Freq: D, Length: 365, dtype: int32
    
    • 传入日期范围作为切片方式
    In [36]: sr['2018-11':'2019-01']  #按年月切片
    Out[36]:
    2018-11-01    304
    2018-11-02    305
    2018-11-03    306
    2018-11-04    307
    2018-11-05    308
    2018-11-06    309
    2018-11-07    310
    2018-11-08    311
    2018-11-09    312
    2018-11-10    313
    2018-11-11    314
    2018-11-12    315
    2018-11-13    316
    2018-11-14    317
    2018-11-15    318
    2018-11-16    319
    2018-11-17    320
    2018-11-18    321
    2018-11-19    322
    2018-11-20    323
    2018-11-21    324
    2018-11-22    325
    2018-11-23    326
    2018-11-24    327
    2018-11-25    328
    2018-11-26    329
    2018-11-27    330
    2018-11-28    331
    2018-11-29    332
    2018-11-30    333
                 ...
    2019-01-02    366
    2019-01-03    367
    2019-01-04    368
    2019-01-05    369
    2019-01-06    370
    2019-01-07    371
    2019-01-08    372
    2019-01-09    373
    2019-01-10    374
    2019-01-11    375
    2019-01-12    376
    2019-01-13    377
    2019-01-14    378
    2019-01-15    379
    2019-01-16    380
    2019-01-17    381
    2019-01-18    382
    2019-01-19    383
    2019-01-20    384
    2019-01-21    385
    2019-01-22    386
    2019-01-23    387
    2019-01-24    388
    2019-01-25    389
    2019-01-26    390
    2019-01-27    391
    2019-01-28    392
    2019-01-29    393
    2019-01-30    394
    2019-01-31    395
    Freq: D, Length: 92, dtype: int32
    
    In [37]: sr['2018-12-03':'2019-01-01']  #按日期切片
    Out[37]:
    2018-12-03    336
    2018-12-04    337
    2018-12-05    338
    2018-12-06    339
    2018-12-07    340
    2018-12-08    341
    2018-12-09    342
    2018-12-10    343
    2018-12-11    344
    2018-12-12    345
    2018-12-13    346
    2018-12-14    347
    2018-12-15    348
    2018-12-16    349
    2018-12-17    350
    2018-12-18    351
    2018-12-19    352
    2018-12-20    353
    2018-12-21    354
    2018-12-22    355
    2018-12-23    356
    2018-12-24    357
    2018-12-25    358
    2018-12-26    359
    2018-12-27    360
    2018-12-28    361
    2018-12-29    362
    2018-12-30    363
    2018-12-31    364
    2019-01-01    365
    Freq: D, dtype: int32
    
    • 丰富的函数支持:resample(),strftime()
    In [38]: sr.resample('W').sum()  #按周求和
    Out[38]:
    2018-01-07      21
    2018-01-14      70
    2018-01-21     119
    2018-01-28     168
    2018-02-04     217
    2018-02-11     266
    2018-02-18     315
    2018-02-25     364
    2018-03-04     413
    2018-03-11     462
    2018-03-18     511
    2018-03-25     560
    2018-04-01     609
    2018-04-08     658
    2018-04-15     707
    2018-04-22     756
    2018-04-29     805
    2018-05-06     854
    2018-05-13     903
    2018-05-20     952
    2018-05-27    1001
    2018-06-03    1050
    2018-06-10    1099
    2018-06-17    1148
    2018-06-24    1197
    2018-07-01    1246
    2018-07-08    1295
    2018-07-15    1344
    2018-07-22    1393
    2018-07-29    1442
                  ...
    2020-03-08    5558
    2020-03-15    5607
    2020-03-22    5656
    2020-03-29    5705
    2020-04-05    5754
    2020-04-12    5803
    2020-04-19    5852
    2020-04-26    5901
    2020-05-03    5950
    2020-05-10    5999
    2020-05-17    6048
    2020-05-24    6097
    2020-05-31    6146
    2020-06-07    6195
    2020-06-14    6244
    2020-06-21    6293
    2020-06-28    6342
    2020-07-05    6391
    2020-07-12    6440
    2020-07-19    6489
    2020-07-26    6538
    2020-08-02    6587
    2020-08-09    6636
    2020-08-16    6685
    2020-08-23    6734
    2020-08-30    6783
    2020-09-06    6832
    2020-09-13    6881
    2020-09-20    6930
    2020-09-27    5979
    Freq: W-SUN, Length: 143, dtype: int32
    
    In [39]: sr.resample('A').sum()  #按年求和
    Out[39]:
    2018-12-31     66430
    2019-12-31    199655
    2020-12-31    233415
    Freq: A-DEC, dtype: int32
    
    In [40]: sr.resample('M').mean()  #按月求平均值
    Out[40]:
    2018-01-31     15.0
    2018-02-28     44.5
    2018-03-31     74.0
    2018-04-30    104.5
    2018-05-31    135.0
    2018-06-30    165.5
    2018-07-31    196.0
    2018-08-31    227.0
    2018-09-30    257.5
    2018-10-31    288.0
    2018-11-30    318.5
    2018-12-31    349.0
    2019-01-31    380.0
    2019-02-28    409.5
    2019-03-31    439.0
    2019-04-30    469.5
    2019-05-31    500.0
    2019-06-30    530.5
    2019-07-31    561.0
    2019-08-31    592.0
    2019-09-30    622.5
    2019-10-31    653.0
    2019-11-30    683.5
    2019-12-31    714.0
    2020-01-31    745.0
    2020-02-29    775.0
    2020-03-31    805.0
    2020-04-30    835.5
    2020-05-31    866.0
    2020-06-30    896.5
    2020-07-31    927.0
    2020-08-31    958.0
    2020-09-30    986.5
    Freq: M, dtype: float64
    
    In [41]: sr.truncate(before='2019-11-12')  #截断日期之前的,因为切片能力非常强大,这个已经变的没什么意义了
    Out[41]:
    2019-11-12    680
    2019-11-13    681
    2019-11-14    682
    2019-11-15    683
    2019-11-16    684
    2019-11-17    685
    2019-11-18    686
    2019-11-19    687
    2019-11-20    688
    2019-11-21    689
    2019-11-22    690
    2019-11-23    691
    2019-11-24    692
    2019-11-25    693
    2019-11-26    694
    2019-11-27    695
    2019-11-28    696
    2019-11-29    697
    2019-11-30    698
    2019-12-01    699
    2019-12-02    700
    2019-12-03    701
    2019-12-04    702
    2019-12-05    703
    2019-12-06    704
    2019-12-07    705
    2019-12-08    706
    2019-12-09    707
    2019-12-10    708
    2019-12-11    709
                 ...
    2020-08-28    970
    2020-08-29    971
    2020-08-30    972
    2020-08-31    973
    2020-09-01    974
    2020-09-02    975
    2020-09-03    976
    2020-09-04    977
    2020-09-05    978
    2020-09-06    979
    2020-09-07    980
    2020-09-08    981
    2020-09-09    982
    2020-09-10    983
    2020-09-11    984
    2020-09-12    985
    2020-09-13    986
    2020-09-14    987
    2020-09-15    988
    2020-09-16    989
    2020-09-17    990
    2020-09-18    991
    2020-09-19    992
    2020-09-20    993
    2020-09-21    994
    2020-09-22    995
    2020-09-23    996
    2020-09-24    997
    2020-09-25    998
    2020-09-26    999
    Freq: D, Length: 320, dtype: int32
    

    pandas-文件处理

      读取操作:read_csv 

      数据文件常用格式:csv(以某间隔符分割数据)

      pandas读取文件:从文件名、URL、文件对象中加载数据

    1. read_csv  默认分隔符为逗号
    2. read_table  默认分隔符为制表符

      参数解析

    • sep  指定分隔符,可用正则表达式如's+'
    • index_col  指定某列作为索引
    In [87]: df.to_csv('test.csv',header=True,index=True,na_rep='null',encoding='gbk',columns=['one','two'])  #用DataFrame对象的方法构造一个文件
    
    In [88]: pd.read_csv('test.csv')
    Out[88]: 
      Unnamed: 0  one   two
    0          a  1.0   4.0
    1          b  2.0   5.0
    2          c  3.0   6.0
    3          d  4.0   7.0
    4          e  NaN  10.0
    5          f  NaN   NaN
    
    In [89]: pd.read_csv('test.csv',index_col=0)  #可以通过列的索引值来指定行标签
    Out[89]: 
       one   two
    a  1.0   4.0
    b  2.0   5.0
    c  3.0   6.0
    d  4.0   7.0
    e  NaN  10.0
    f  NaN   NaN
    
    In [90]: pd.read_csv('test.csv',index_col='one') #可以通过列名来指定行标签
    Out[90]: 
         Unnamed: 0   two
    one
     1.0          a   4.0
     2.0          b   5.0
     3.0          c   6.0
     4.0          d   7.0
    NaN           e  10.0
    NaN           f   NaN
    

      如果把时间按上述方式读进来,还有个问题,就是时间读进行,虽然做了索引,并不是一个时间对象
      只是一个字符串,怎么转化成时间对象呢?

    • parse_dates  指定某些列是否被解析为日期,类型为布尔或列表
    pd.read_csv('test.csv',index_col='date',parse_dates=True)  #对表里的所有的能解析成时间序列都解析
    pd.read_csv('test.csv',index_col='date',parse_dates=['date'])  #对这一列进行时间解析
    
    •  header=None  指定文件无列名
    • names  指定列名,传列表

      如果不存在列名这行,数据获取时,会以第一行的数据为列名,如果要指定,可以如下操作

    In [106]: pd.read_csv('test.csv')
    Out[106]: 
       1.0   4.0
    0  2.0   5.0
    1  3.0   6.0
    2  4.0   7.0
    3  NaN  10.0
    4  NaN   NaN
    
    In [107]: pd.read_csv('test.csv',header=None)  #告诉解析器说数据不带列名,同时也是不把第一行数据作为列名,列名默认为从0开始的数字
    Out[107]: 
         0     1
    0  1.0   4.0
    1  2.0   5.0
    2  3.0   6.0
    3  4.0   7.0
    4  NaN  10.0
    5  NaN   NaN
    
    In [108]: pd.read_csv('test.csv',header=None,names=list('gh')) #指定列名gk,传列表
    Out[108]: 
         g     h
    0  1.0   4.0
    1  2.0   5.0
    2  3.0   6.0
    3  4.0   7.0
    4  NaN  10.0
    5  NaN   NaN
    
    •  na_values  指定某个值,或者说某个字符串表示缺失值(NaN)
    • skiprows  指定跳过某些行
    In [110]: pd.read_csv('test.csv',header=None,names=list('gh'),na_values='10')
    Out[110]: 
         g    h
    0  1.0  4.0
    1  2.0  5.0
    2  3.0  6.0
    3  4.0  7.0
    4  NaN  NaN
    5  NaN  NaN
    
    In [111]: pd.read_csv('test.csv',header=None,names=list('gh'),na_values='10',skiprows=[4,5])
    Out[111]: 
         g    h
    0  1.0  4.0
    1  2.0  5.0
    2  3.0  6.0
    3  4.0  7.0
    

      写入操作:to_csv函数

    • sep  指定文件分隔符
    • na_rep  指定缺失值转换的字符串,默认为空字符串
    • header=False  不输出列名一行
    • index=False  不输出行索引一列
    • cols  指定输入的列,传入列表
    In [59]: df.to_csv('test3.csv',header=False,index=False,na_rep='null',encoding='gbk',columns=['年份','
        ...: 股票代码','股票价格'])
    
    In [60]: pd.read_csv('test3.csv',encoding='gbk')
    

       pandas支持的其他文件类型:json,XML,HTML,数据库,pickle,excel...

    In [68]: df.to_html('test.html',header=False,index=False,na_rep='null',columns=['年份','股票代码','
        ...: 股票价格'])
    
    In [5]: pd.read_html('test.html',encoding='gbk')  #读这些文件类型都要安装另外的模块
    
  • 相关阅读:
    Pytorch-实战之对Himmelblau函数的优化
    Pytorch-tensor的感知机,链式法则
    Pytorch-tensor的激活函数
    Pytorch-tensor的分割,属性统计
    Pytorch-tensor的转置,运算
    Pytorch-tensor维度的扩展,挤压,扩张
    Transformer代码细节
    Leetcode 1494
    格雷码
    两个正序数组的中位数
  • 原文地址:https://www.cnblogs.com/xinsiwei18/p/9811964.html
Copyright © 2011-2022 走看看