zoukankan      html  css  js  c++  java
  • pandas

    numpy处理数值,但是pandas除了处理数值之外(基于numpy),还能处理其他类型数据。
    
    pandas常用数据类型
        Series一维,带标签数组
        DataFrame二维,Series容器

    Series创建

    In [97]: import pandas as pd
    
    In [98]: pd.Series([1,2,3,4,5])         # 第一列对象键(index, 索引),第二列对象值(values)
    Out[98]:
    0    1
    1    2
    2    3
    3    4
    4    5
    dtype: int64
    
    In [99]: t = pd.Series([1,2,3,4,5])
    
    In [100]: type(t)
    Out[100]: pandas.core.series.Series
    
    In [101]: t2 = pd.Series([1,2,3,4],index=list('abcd'))      # index指定索引
    
    In [102]: t2
    Out[102]:
    a    1
    b    2
    c    3
    d    4
    dtype: int64
    
    In [105]: pd.Series(np.arange(2))
    Out[105]:
    0    0
    1    1
    dtype: int64
    
    In [107]: t3 = pd.Series({'name':'xyp', 'age':18})
    
    In [108]: t3
    Out[108]:
    name    xyp
    age      18
    dtype: object
    
    In [109]: t3.dtype
    Out[109]: dtype('O')    # o代表object
    
    In [110]: t2.dtype
    Out[110]: dtype('int64')
    
    In [111]: t2.astype(float)
    Out[111]:
    a    1.0
    b    2.0
    c    3.0
    d    4.0
    dtype: float64
    

    Series切片和索引

    In [118]: t3 = pd.Series({'name':'xyp','age':18,'tel':10086})
    
    In [119]: t3
    Out[119]:
    name      xyp
    age        18
    tel     10086
    dtype: object
    
    In [120]: t3[0]
    Out[120]: 'xyp'
    
    In [121]: t3['name']
    Out[121]: 'xyp'
    
    In [122]: t3[[1,2]]
    Out[122]:
    age       18
    tel    10086
    dtype: object
    
    In [123]: t3[[0,2]]
    Out[123]:
    name      xyp
    tel     10086
    dtype: object
    
    In [125]: t3[['age','tel']]
    Out[125]:
    age       18
    tel    10086
    dtype: object
    
    In [128]: t3
    Out[128]:
    name      xyp
    age        18
    tel     10086
    dtype: object
    
    In [129]: t3.index
    Out[129]: Index(['name', 'age', 'tel'], dtype='object')
    
    In [130]: for i in t3.index:
         ...:     print(i)
         ...:
    name
    age
    tel
    
    In [131]: type(t3.index)
    Out[131]: pandas.core.indexes.base.Index
    
    In [132]: len(t3.index)
    Out[132]: 3
    
    In [133]: list(t3.index)
    Out[133]: ['name', 'age', 'tel']
    
    In [136]: s = pd.Series(range(5))
    
    In [137]: s.where(s>0)
    Out[137]:
    0    NaN
    1    1.0
    2    2.0
    3    3.0
    4    4.0
    dtype: float64
    
    In [139]: s.where(s>1, 10)  # s元素 = s元素 s元素>1 else 10,和numpy用法相反
    Out[139]:
    0    10
    1    10
    2     2
    3     3
    4     4
    dtype: int64
    
    In [126]: t
    Out[126]:
    0    1
    1    2
    2    3
    3    4
    4    5
    dtype: int64
    
    In [127]: t[t>3]
    Out[127]:
    3    4
    4    5
    dtype: int64
    

    pandas读取外部文件

    pd.read_csv
    pd.read_excel
    pd.read_sql
    mongodb数据和mysql不同,需先连接mongodb取出数据然后放入pd.Series(data)中处理
    ......
    

    DataFrame创建

    In [4]: pd.DataFrame(np.arange(12).reshape(3,4))            # 第一行为行索引,第一列为列索引
    Out[4]:
       0  1   2   3
    0  0  1   2   3
    1  4  5   6   7
    2  8  9  10  11
    
    In [6]: pd.DataFrame(np.arange(12).reshape(3,4),index=list('abc'),columns=list('WXYZ'))
    Out[6]:
       W  X   Y   Z
    a  0  1   2   3
    b  4  5   6   7
    c  8  9  10  11
    
    In [7]: t1 = pd.DataFrame(np.arange(12).reshape(3,4),index=list('abc'),columns=list('WXYZ'))
    
    In [8]: t1
    Out[8]:
       W  X   Y   Z
    a  0  1   2   3
    b  4  5   6   7
    c  8  9  10  11
    
    In [9]: d1 = {'name':['xyp','oynn'],'age':[18,20],'tel':[10086,10087]}
    
    In [10]: pd.DataFrame(d1)
    Out[10]:
       name  age    tel
    0   xyp   18  10086
    1  oynn   20  10087
    
    In [11]: t1 = pd.DataFrame(d1)
    
    In [12]: type(t1)
    Out[12]: pandas.core.frame.DataFrame
    
    In [13]: d2 = [{'name':'xyp','age':18,'tel':10086},{'name':'oynn','age':20},{'age':16,'tel':10087}]
    
    In [14]: t2 = pd.DataFrame(d2)
    
    In [15]: t2
    Out[15]:
       name  age      tel
    0   xyp   18  10086.0
    1  oynn   20      NaN
    2   NaN   16  10087.0
    

    DataFrame基础属性

    In [15]: t2
    Out[15]:
       name  age      tel
    0   xyp   18  10086.0
    1  oynn   20      NaN
    2   NaN   16  10087.0
    
    In [16]: t2.index
    Out[16]: RangeIndex(start=0, stop=3, step=1)
    
    In [17]: t2.columns
    Out[17]: Index(['name', 'age', 'tel'], dtype='object')
    
    In [18]: t2.values
    Out[18]:
    array([['xyp', 18, 10086.0],
           ['oynn', 20, nan],
           [nan, 16, 10087.0]], dtype=object)
    
    In [19]: t2.shape
    Out[19]: (3, 3)
    
    In [20]: t2.dtypes
    Out[20]:
    name     object
    age       int64
    tel     float64
    dtype: object
    
    In [21]: t2.ndim        # t2维度
    Out[21]: 2
    
    In [22]: t2.head(2)     # 显示头部几行,默认5行
    Out[22]:
       name  age      tel
    0   xyp   18  10086.0
    1  oynn   20      NaN
    
    In [23]: t2.tail(2)     # 显示末尾几行,默认5行
    Out[23]:
       name  age      tel
    1  oynn   20      NaN
    2   NaN   16  10087.0
    
    In [24]: t2.info()      # 相关信息概览:行数,列数,列索引,列非空值个数,列类型,内存占用
    Out[24]:
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 3 entries, 0 to 2               # 3行,索引0 to 2
    Data columns (total 3 columns):
    name    2 non-null object
    age     3 non-null int64
    tel     2 non-null float64
    dtypes: float64(1), int64(1), object(1)
    memory usage: 200.0+ bytes                  # 内存
    
    In [25]: t2.describe()      # 快速综合统计结果
    Out[25]:
            age           tel
    count   3.0      2.000000       # 计数
    mean   18.0  10086.500000       # 均值
    std     2.0      0.707107       # 标准差
    min    16.0  10086.000000       # 最小值
    25%    17.0  10086.250000       # 四分之一中位数
    50%    18.0  10086.500000       # 中位数
    75%    19.0  10086.750000       # 四分之三中位数
    max    20.0  10087.000000       # 最大值
    

    DataFrame排序方法

    In [27]: t2
    Out[27]:
       name  age      tel
    0   xyp   18  10086.0
    1  oynn   20      NaN
    2   NaN   16  10087.0
    
    In [30]: t2.sort_values(by='age')                                                                           
    Out[30]: 
       name  age      tel
    2   NaN   16  10087.0
    0   xyp   18  10086.0
    1  oynn   20      NaN
    
    In [31]: t2.sort_values(by='age',ascending=False)                                                           
    Out[31]: 
       name  age      tel
    1  oynn   20      NaN
    0   xyp   18  10086.0
    2   NaN   16  10087.0
    
    In [32]: df = t2.sort_values(by='age',ascending=False)                                                      
    
    In [33]: df.head(2)                                                                                         
    Out[33]: 
       name  age      tel
    1  oynn   20      NaN
    0   xyp   18  10086.0
    

      

    DataFrame的索引

    In [33]: df
    Out[33]:
       name  age      tel
    1  oynn   20      NaN
    0   xyp   18  10086.0
    2   NaN   16  10087.0
    
    In [34]: df[:2]
    Out[34]:
       name  age      tel
    1  oynn   20      NaN
    0   xyp   18  10086.0
    
    In [35]: df[:2]['age']
    Out[35]:
    1    20
    0    18
    Name: age, dtype: int64
    
    In [36]: df['name']
    Out[36]:
    1    oynn
    0     xyp
    2     NaN
    Name: name, dtype: object
    
    In [37]: type(df['name'])
    Out[37]: pandas.core.series.Series
     

    loc通过标签索引获取数据

    In [48]: t2
    Out[48]:
       name  age      tel
    0   xyp   18  10086.0
    1  oynn   20      NaN
    2   NaN   16  10087.0
    
    In [49]: t2.loc[0,'name']       # loc[a,b] a为行索引,b为列索引
    Out[49]: 'xyp'
    
    In [50]: t2.loc[1,'age']
    Out[50]: 20
    
    In [51]: type(t2.loc[1,'age'])
    Out[51]: numpy.int64
    
    In [55]: t2.loc[2]
    Out[55]:
    name      NaN
    age        16
    tel     10087
    Name: 2, dtype: object
    
    In [56]: t2.loc[2,:]
    Out[56]:
    name      NaN
    age        16
    tel     10087
    Name: 2, dtype: object
    
    In [57]: t2.loc[:,'age']
    Out[57]:
    0    18
    1    20
    2    16
    Name: age, dtype: int64
    
    In [58]: t2.loc[[0,2],:]
    Out[58]:
      name  age      tel
    0  xyp   18  10086.0
    2  NaN   16  10087.0
    
    In [59]: t2.loc[:,['name','age']]
    Out[59]:
       name  age
    0   xyp   18
    1  oynn   20
    2   NaN   16
    
    In [62]: t2.loc[[1,2],['name','tel']]
    Out[62]:
       name      tel
    1  oynn      NaN
    2   NaN  10087.0
    
    In [63]: type(t2.loc[[1,2],['name','tel']])
    Out[63]: pandas.core.frame.DataFrame
    
    In [64]: t2.loc[0:2,['name','tel']]          # loc取索引,能取到索引最后一个数据
    Out[64]:
       name      tel
    0   xyp  10086.0
    1  oynn      NaN
    2   NaN  10087.0
    
    In [64]: t2.loc[0:2,'name':'tel']            # loc取索引,能取到索引最后一个数据
    Out[64]:
       name  age      tel
    0   xyp   18  10086.0
    1  oynn   20      NaN
    2   NaN   16  10087.0
    

      

    iloc通过位置索引获取数据

    In [66]: t2.iloc[0:2,[0,1]]
    Out[66]:
       name  age
    0   xyp   18
    1  oynn   20
    
    In [67]: t2.iloc[0:2,0:2]       # iloc取索引不包括最后一个数据
    Out[67]:
       name  age
    0   xyp   18
    1  oynn   20
    
    In [73]: t2.iloc[:,1] = 18
    
    In [74]: t2
    Out[74]:
       name  age      tel
    0   xyp   18  10086.0
    1  oynn   18      NaN
    2   NaN   18  10087.0
    

      

    布尔索引

    In [79]: t2
    Out[79]:
       name  age      tel
    0   xyp   18  10086.0
    1  oynn   20      NaN
    2   NaN   22  10087.0
    
    In [81]: t2[t2['age']>18]
    Out[81]:
       name  age      tel
    1  oynn   20      NaN
    2   NaN   22  10087.0
    
    In [84]: t2[(t2['age']>18) & (t2['age']<22)]
    Out[84]:
       name  age  tel
    1  oynn   20  NaN
    
    In [85]: t2[(t2['age']>18) | (t2['age']<22)]
    Out[85]:
       name  age      tel
    0   xyp   18  10086.0
    1  oynn   20      NaN
    2   NaN   22  10087.0
    

      

    字符串方法

    In [95]: t2
    Out[95]:
       name  age          tel
    0   xyp   18  10086,10087
    1  oynn   20          NaN
    2   NaN   22  10086,10087
    
    In [96]: t2['tel'].str.split(',')
    Out[96]:
    0    [10086, 10087]
    1               NaN
    2    [10086, 10087]
    Name: tel, dtype: object
    
    ....
    

      

    pandas缺失数据处理

    In [99]: t2
    Out[99]:
       name  age          tel
    0   xyp   18  10086,10087
    1  oynn   20          NaN
    2   NaN   22  10086,10087
    
    In [101]: pd.isnull(t2)         # 判断数据是否有NaN
    Out[101]:
        name    age    tel
    0  False  False  False
    1  False  False   True
    2   True  False  False
    
    In [102]: pd.notnull(t2)        # 判断数据是否有NaN
    Out[102]:
        name   age    tel
    0   True  True   True
    1   True  True  False
    2  False  True   True
    
    
    1. 删除NaN所在的行列
    
    In [104]: t2.dropna(axis=0)     # 默认为dropna(axis=0, how='any', inplace=False), how='all'时为数据均为NaN才会删除,inplace=True时直接修改t2数组,不需要重新赋值就能改变t2数组
    Out[104]:
      name  age          tel
    0  xyp   18  10086,10087
    
    In [106]: t2.dropna(axis=0, how='all')
    Out[106]:
       name  age          tel
    0   xyp   18  10086,10087
    1  oynn   20          NaN
    2   NaN   22  10086,10087
    
    
    2. 填充数据
    
    In [107]: t2
    Out[107]:
       name  age          tel
    0   xyp   18  10086,10087
    1  oynn   20          NaN
    2   NaN   22  10086,10087
    
    In [111]: t2.fillna(t2.mean())      # t2.mean() nan不参与计算均值。name,tel不是数值所以填充失败。
    Out[111]:
       name  age          tel
    0   xyp   18  10086,10087
    1  oynn   20          NaN
    2   NaN   22  10086,10087
    
    In [112]: t2.fillna(0)
    Out[112]:
       name  age          tel
    0   xyp   18  10086,10087
    1  oynn   20            0
    2     0   22  10086,10087
    

      

    import pandas as pd
    
    file_path = 'IMDB-Movie-Data.csv'
    
    df = pd.read_csv(file_path)
    
    # 获取电影平均评分
    print(df['Rating'].mean())  # 6.7232
    # 导演人数
    print(len(set(df['Director'].tolist())))    # 644
    print(len(df['Director'].unique()))         # 644
    # 获取演员人数
    temp_actors_list = df['Actors'].str.split(',').tolist()
    actors_list = [j for i in temp_actors_list for j in i]      # 双重循环推导式
    actors_num = len(set(actors_list))
    print(actors_num)   # 2394
    获取电影平均评分

    数据合并

    # 数组合并join,行合并
    
    In [1]: import pandas as pd
    
    In [2]: import numpy as np
    
    In [3]: df1 = pd.DataFrame(np.ones((2,4)),index=['A','B'],columns=list('abcd'))
    
    In [4]: df1
    Out[4]:
         a    b    c    d
    A  1.0  1.0  1.0  1.0
    B  1.0  1.0  1.0  1.0
    
    In [5]: df2 = pd.DataFrame(np.zeros((3,3)),index=['A','B','C'],columns=list('xyz'))
    
    In [6]: df2
    Out[6]:
         x    y    z
    A  0.0  0.0  0.0
    B  0.0  0.0  0.0
    C  0.0  0.0  0.0
    
    In [7]: df1.join(df2)
    Out[7]:
         a    b    c    d    x    y    z
    A  1.0  1.0  1.0  1.0  0.0  0.0  0.0
    B  1.0  1.0  1.0  1.0  0.0  0.0  0.0
    
    In [8]: df2.join(df1)
    Out[8]:
         x    y    z    a    b    c    d
    A  0.0  0.0  0.0  1.0  1.0  1.0  1.0
    B  0.0  0.0  0.0  1.0  1.0  1.0  1.0
    C  0.0  0.0  0.0  NaN  NaN  NaN  NaN
    
    
    # 数组合并merge,列合并
    
    In [12]: df1
    Out[12]:
         a    b    c    d
    A  1.0  1.0  1.0  1.0
    B  1.0  1.0  1.0  1.0
    
    In [14]: df3 = pd.DataFrame(np.arange(9).reshape((3,3)),columns=list('fax'))
    
    In [15]: df3
    Out[15]:
       f  a  x
    0  0  1  2
    1  3  4  5
    2  6  7  8
    
    In [16]: df1.merge(df3,on='a')      # on交集合并,按照a列合并,df1中a列坐标(A,a)和(B,a),和df3中a列坐标(0,a)相等,df1中共2行,合并成2行
    Out[16]:
         a    b    c    d  f  x
    0  1.0  1.0  1.0  1.0  0  2
    1  1.0  1.0  1.0  1.0  0  2
    
    In [17]: df1.loc['A','a'] = 100
    
    In [18]: df1
    Out[18]:
           a    b    c    d
    A  100.0  1.0  1.0  1.0
    B    1.0  1.0  1.0  1.0
    
    In [19]: df1.merge(df3,on='a')      # on交集合并,按照a列合并,df1中a列坐标(B,a),和df3中a列坐标(0,a)相等,df1中共1行,合并成1行
    Out[19]:
         a    b    c    d  f  x
    0  1.0  1.0  1.0  1.0  0  2
    
    In [20]: df1.merge(df3,on='a',how='inner')      # inner内连接,交集
    Out[20]:
         a    b    c    d  f  x
    0  1.0  1.0  1.0  1.0  0  2
    
    In [21]: df1.merge(df3,on='a',how='outer')      # outer外连接,并集
    Out[21]:
           a    b    c    d    f    x
    0  100.0  1.0  1.0  1.0  NaN  NaN
    1    1.0  1.0  1.0  1.0  0.0  2.0
    2    4.0  NaN  NaN  NaN  3.0  5.0
    3    7.0  NaN  NaN  NaN  6.0  8.0
    

      

    分组聚合

    df.groupby(by='')   # by值可以为字符串,单个字段分组;by值为[df[''], df[''], ...]多个字段分组
    ......
    

      

    索引和复合索引

    In [5]: df1
    Out[5]:
         a    b    c    d
    A  1.0  1.0  1.0  1.0
    B  1.0  1.0  1.0  1.0
    
    In [6]: df1.index                       # 获取索引
    Out[6]: Index(['A', 'B'], dtype='object')
    
    In [8]: df1.index = ['a', 'b']          # 指定索引
    
    In [9]: df1
    Out[9]:
         a    b    c    d
    a  1.0  1.0  1.0  1.0
    b  1.0  1.0  1.0  1.0
    
    In [10]: df1.index
    Out[10]: Index(['a', 'b'], dtype='object')
    
    In [11]: df1.reindex(['a','f'])         # 重新设置索引,用的比较少
    Out[11]:
         a    b    c    d
    a  1.0  1.0  1.0  1.0
    f  NaN  NaN  NaN  NaN
    
    In [12]: df1
    Out[12]:
         a    b    c    d
    a  1.0  1.0  1.0  1.0
    b  1.0  1.0  1.0  1.0
    
    In [13]: df1.set_index('a')             # 指定某一列作为索引
    Out[13]:
           b    c    d
    a
    1.0  1.0  1.0  1.0
    1.0  1.0  1.0  1.0
    
    In [14]: df1.set_index('a',drop=False)     # drop=False不删除指定索引列
    Out[14]:
           a    b    c    d
    a
    1.0  1.0  1.0  1.0  1.0
    1.0  1.0  1.0  1.0  1.0
    
    In [31]: df1
    Out[31]:
           a    b    c    d
    A  100.0  1.0  1.0  1.0
    B    1.0  1.0  1.0  1.0
    
    In [32]: df1['d'].unique()             # 对某一列来取值,且值不重复
    Out[32]: array([1.])
    
    In [33]: df1['a'].unique()
    Out[33]: array([100.,   1.])
    
    In [36]: df1.set_index('b').index.unique()
    Out[36]: Float64Index([1.0], dtype='float64', name='b')
    
    In [37]: len(df1.set_index('b').index.unique())
    Out[37]: 1
    
    In [38]: len(df1.set_index('b').index)
    Out[38]: 2
    
    In [39]: list(df1.set_index('b').index)
    Out[39]: [1.0, 1.0]
    

      

    时间序列 

    # 时间序列
    
    In [40]: pd.date_range(start='20190831',end='20190930',freq='D')
    Out[40]:
    DatetimeIndex(['2019-08-31', '2019-09-01', '2019-09-02', '2019-09-03',
                   '2019-09-04', '2019-09-05', '2019-09-06', '2019-09-07',
                   '2019-09-08', '2019-09-09', '2019-09-10', '2019-09-11',
                   '2019-09-12', '2019-09-13', '2019-09-14', '2019-09-15',
                   '2019-09-16', '2019-09-17', '2019-09-18', '2019-09-19',
                   '2019-09-20', '2019-09-21', '2019-09-22', '2019-09-23',
                   '2019-09-24', '2019-09-25', '2019-09-26', '2019-09-27',
                   '2019-09-28', '2019-09-29', '2019-09-30'],
                  dtype='datetime64[ns]', freq='D')
    
    In [41]: pd.date_range(start='20190831',end='20190930',freq='10D')
    Out[41]: DatetimeIndex(['2019-08-31', '2019-09-10', '2019-09-20', '2019-09-30'], dtype='datetime64[ns]', freq='10D')
    
    In [42]: pd.date_range(start='20190831',periods=10,freq='D')
    Out[42]:
    DatetimeIndex(['2019-08-31', '2019-09-01', '2019-09-02', '2019-09-03',
                   '2019-09-04', '2019-09-05', '2019-09-06', '2019-09-07',
                   '2019-09-08', '2019-09-09'],
                  dtype='datetime64[ns]', freq='D')
    
    In [43]: pd.date_range(start='20190831',periods=10,freq='M')
    Out[43]:
    DatetimeIndex(['2019-08-31', '2019-09-30', '2019-10-31', '2019-11-30',
                   '2019-12-31', '2020-01-31', '2020-02-29', '2020-03-31',
                   '2020-04-30', '2020-05-31'],
                  dtype='datetime64[ns]', freq='M')
    
    
    pd.to_datetime(df['timeStamp'],format='')
    
    
    # resample重采样,是对原样本重新处理的一个方法,是一个对常规时间序列数据重新采样和频率转换的便捷的方法。
    降采样:高频数据到低频数据
    升采样:低频数据到高频数据
    
    df.resample('D')
    
    
    # PeriodIndex重组时间序列,主要将数据中的分离的时间字段,重组为时间序列,并指定为index
    period = pd.PeriodIndex(year=df['year'],month=df['month'],day=df['day'],hour=df['hour'],freq='H')
    

      

      

  • 相关阅读:
    web框架基础
    前端基础之DOM和jQuery
    前端基础之JavaScript
    前端基础之CSS
    福州大学W班-助教总结
    福州大学W班-个人最终成绩统计
    福州大学W班-Beta冲刺评分
    福州大学W班-alpha冲刺评分
    福州大学W班-团队作业-随堂小测(同学录)成绩
    福州大学W班-需求分析评分排名
  • 原文地址:https://www.cnblogs.com/xuyaping/p/13572781.html
Copyright © 2011-2022 走看看