zoukankan      html  css  js  c++  java
  • python之pandas简介

    参考自:https://www.php.cn/python-tutorials-427622.html

    1. 安装第三方库
    pip install numpy
    pip install pandas
    pip install xlrd    # 操作excel时会用到
    
    1. 构造数据 DataFrame

    第一个参数:

    ​ 字典:作为每列数据,字典的键作为列名,字典的值作为列的数据

    ​ 嵌套列表:[[1,2,3],[4,5,6]],则每个内嵌的列表作为每列值(需要搭配第二个参数来作为列名)

    第二个参数:

    ​ 列表:列表的每个元素作为列名

    import numpy as np
    import pandas as pd
    
    def println(*args):
        print(*args,end='
    ==================================结果分割线==================================
    ')
    
    df = pd.DataFrame(
        {
            "id":[1001,1002,1003,1004,1005,1006],
            "date":pd.date_range('20130102', periods=6),
            "city":['Beijing ', 'SH', ' guangzhou', 'shanghai', 'sh ', 'BEIJING '],
            "age":[23,44,54,32,34,32],
            "category":['100-A','100-B','110-A','110-C','210-A','130-F'],
            "price":[1200,np.nan,2133,5433,np.nan,4432] 		# np.nan是空数据的意思
        },
        columns =['id','date','city','category','age','price']) # 此行的列表可以不写,因为上面的字典的键已经当做列名了,如果写了,则以这行的数据作为列名,字典的键就被无视了
    
    1. 查看数据类型,基础信息,是否为空
    println(df,type(df))       # 打印数据
    println(df.shape)          # 打印数据的规模:(几行,几列)
    println(df.info())         # 显示数据信息,类型等。
    println(df.dtypes)         # 显示数据类型
    println(df['id'].dtype)    # 显示 id列 的类型
    println(df.isnull())       # 实现是否为空
    println(df['price'].isnull()) # 显示 price列 是否有空数据
    
    D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
         id       date        city category  age   price
    0  1001 2013-01-02    Beijing     100-A   23  1200.0
    1  1002 2013-01-03          SH    100-B   44     NaN
    2  1003 2013-01-04   guangzhou    110-A   54  2133.0
    3  1004 2013-01-05    shanghai    110-C   32  5433.0
    4  1005 2013-01-06         sh     210-A   34     NaN
    5  1006 2013-01-07    BEIJING     130-F   32  4432.0 <class 'pandas.core.frame.DataFrame'>
    ==================================结果分割线==================================
    (6, 6)
    ==================================结果分割线==================================
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 6 entries, 0 to 5
    Data columns (total 6 columns):
     #   Column    Non-Null Count  Dtype         
    ---  ------    --------------  -----         
     0   id        6 non-null      int64         
     1   date      6 non-null      datetime64[ns]
     2   city      6 non-null      object        
     3   category  6 non-null      object        
     4   age       6 non-null      int64         
     5   price     4 non-null      float64       
    dtypes: datetime64[ns](1), float64(1), int64(2), object(2)
    memory usage: 304.0+ bytes
    None
    ==================================结果分割线==================================
    id                   int64
    date        datetime64[ns]
    city                object
    category            object
    age                  int64
    price              float64
    dtype: object
    ==================================结果分割线==================================
    int64
    ==================================结果分割线==================================
          id   date   city  category    age  price
    0  False  False  False     False  False  False
    1  False  False  False     False  False   True
    2  False  False  False     False  False  False
    3  False  False  False     False  False  False
    4  False  False  False     False  False   True
    5  False  False  False     False  False  False
    ==================================结果分割线==================================
    0    False
    1     True
    2    False
    3    False
    4     True
    5    False
    Name: price, dtype: bool
    
    1. 数据去重,数据,列名,前10行,后10行
    println(df['id'].unique()) # 获取去重后的id列的数据
    println(df.values) # 全表的数据
    println(df.columns) # 所有的列名
    println(df.head()) 	#默认前10行数据
    println(df.tail()) 	#默认后10行数据
    
    D:Softwarepython37.32python.exe C:/Users/.wang/Desktop/pan.py
    [1001 1002 1003 1004 1005 1006]
    ==================================结果分割线==================================
    [[1001 Timestamp('2013-01-02 00:00:00') 'Beijing ' '100-A' 23 1200.0]
     [1002 Timestamp('2013-01-03 00:00:00') 'SH' '100-B' 44 nan]
     [1003 Timestamp('2013-01-04 00:00:00') ' guangzhou' '110-A' 54 2133.0]
     [1004 Timestamp('2013-01-05 00:00:00') 'shanghai' '110-C' 32 5433.0]
     [1005 Timestamp('2013-01-06 00:00:00') 'sh ' '210-A' 34 nan]
     [1006 Timestamp('2013-01-07 00:00:00') 'BEIJING ' '130-F' 32 4432.0]]
    ==================================结果分割线==================================
    Index(['id', 'date', 'city', 'category', 'age', 'price'], dtype='object')
    ==================================结果分割线==================================
         id       date        city category  age   price
    0  1001 2013-01-02    Beijing     100-A   23  1200.0
    1  1002 2013-01-03          SH    100-B   44     NaN
    2  1003 2013-01-04   guangzhou    110-A   54  2133.0
    3  1004 2013-01-05    shanghai    110-C   32  5433.0
    4  1005 2013-01-06         sh     210-A   34     NaN
    ==================================结果分割线==================================
         id       date        city category  age   price
    1  1002 2013-01-03          SH    100-B   44     NaN
    2  1003 2013-01-04   guangzhou    110-A   54  2133.0
    3  1004 2013-01-05    shanghai    110-C   32  5433.0
    4  1005 2013-01-06         sh     210-A   34     NaN
    5  1006 2013-01-07    BEIJING     130-F   32  4432.0
    
    
    1. 填充空值
    df.fillna(value=0,inplace=True)  # inplace=True,会直接改写原表,不写这个参数,则不会更改原表
    println(df)
    
         id       date        city category  age   price
    0  1001 2013-01-02    Beijing     100-A   23  1200.0
    1  1002 2013-01-03          SH    100-B   44     0.0
    2  1003 2013-01-04   guangzhou    110-A   54  2133.0
    3  1004 2013-01-05    shanghai    110-C   32  5433.0
    4  1005 2013-01-06         sh     210-A   34     0.0
    5  1006 2013-01-07    BEIJING     130-F   32  4432.0
    
    
    1. 获取一列数据的平均值
    println(df['price'].mean())
    
    3299.5
    ==================================结果分割线==================================
    
    1. 数据清洗:删除多余空格,大小写转换
    df['city']=df['city'].map(str.strip)
    println(df)
    
    df['city'] = df['city'].str.lower()
    println(df['city'])
    
    D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
         id       date       city category  age   price
    0  1001 2013-01-02    Beijing    100-A   23  1200.0
    1  1002 2013-01-03         SH    100-B   44     NaN
    2  1003 2013-01-04  guangzhou    110-A   54  2133.0
    3  1004 2013-01-05   shanghai    110-C   32  5433.0
    4  1005 2013-01-06         sh    210-A   34     NaN
    5  1006 2013-01-07    BEIJING    130-F   32  4432.0
    ==================================结果分割线==================================
    0      beijing
    1           sh
    2    guangzhou
    3     shanghai
    4           sh
    5      beijing
    Name: city, dtype: object
    ==================================结果分割线==================================
    
    Process finished with exit code 0
    
    1. 数据类型转换(需要先将此列所有的空值填充一个数才行)
    df['price'].fillna(value = 0,inplace=True)
    println(df['price'].astype('int'))
    
    D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
    0    1200
    1       0
    2    2133
    3    5433
    4       0
    5    4432
    Name: price, dtype: int32
    ==================================结果分割线==================================
    
    1. 重命名列名
    df.rename(columns={'category': 'category-size'},inplace=True)
    println(df)
    
    D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
         id       date        city category-size  age   price
    0  1001 2013-01-02    Beijing          100-A   23  1200.0
    1  1002 2013-01-03          SH         100-B   44     NaN
    2  1003 2013-01-04   guangzhou         110-A   54  2133.0
    3  1004 2013-01-05    shanghai         110-C   32  5433.0
    4  1005 2013-01-06         sh          210-A   34     NaN
    5  1006 2013-01-07    BEIJING          130-F   32  4432.0
    ==================================结果分割线==================================
    
    Process finished with exit code 0
    
    1. 删除一列数据的重复项(去重)
    df['city'].drop_duplicates()
    df['city'].drop_duplicates(keep='last') # keep='last','first','False'; 针对重复项,分别为:保留最后一个,保留第一个,都不保留
    
    1. 替换数据
    println(df['city'].replace('sh ', 'shanghai')) # 将 'sh ' 替换为 'shanghai'
    
    D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
    0      Beijing 
    1            SH
    2     guangzhou
    3      shanghai
    4      shanghai
    5      BEIJING 
    Name: city, dtype: object
    ==================================结果分割线==================================
    
    Process finished with exit code 0
    
    
    1. 两表合并
    df1=pd.DataFrame(
        {
            "id":[1001,1002,1003,1004,1005,1006,1007,1008],
            "gender":['male','female','male','female','male','female','male','female'],
            "pay":['Y','N','Y','Y','N','Y','N','Y',],
            "m-point":[10,12,20,40,40,40,30,20]
        })
    
    df_inner = pd.merge(df,df1,how='inner')  # 匹配合并,交集,等同于sql的内连接:select * from df inner join df1 on df.id = df1.id
    println(df_inner)
    
    df_left=pd.merge(df,df1,how='left') # 等同于sql的左连接
    println(df_left)
    
    df_right=pd.merge(df,df1,how='right') # 右连接
    println(df_right)
    
    df_outer=pd.merge(df,df1,how='outer')  #并集,等同于sql的 full outer join
    println(df_outer)
    
    D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
         id       date        city category  age   price  gender pay  m-point
    0  1001 2013-01-02    Beijing     100-A   23  1200.0    male   Y       10
    1  1002 2013-01-03          SH    100-B   44     NaN  female   N       12
    2  1003 2013-01-04   guangzhou    110-A   54  2133.0    male   Y       20
    3  1004 2013-01-05    shanghai    110-C   32  5433.0  female   Y       40
    4  1005 2013-01-06         sh     210-A   34     NaN    male   N       40
    5  1006 2013-01-07    BEIJING     130-F   32  4432.0  female   Y       40
    ==================================结果分割线==================================
         id       date        city category  age   price  gender pay  m-point
    0  1001 2013-01-02    Beijing     100-A   23  1200.0    male   Y       10
    1  1002 2013-01-03          SH    100-B   44     NaN  female   N       12
    2  1003 2013-01-04   guangzhou    110-A   54  2133.0    male   Y       20
    3  1004 2013-01-05    shanghai    110-C   32  5433.0  female   Y       40
    4  1005 2013-01-06         sh     210-A   34     NaN    male   N       40
    5  1006 2013-01-07    BEIJING     130-F   32  4432.0  female   Y       40
    ==================================结果分割线==================================
         id       date        city category   age   price  gender pay  m-point
    0  1001 2013-01-02    Beijing     100-A  23.0  1200.0    male   Y       10
    1  1002 2013-01-03          SH    100-B  44.0     NaN  female   N       12
    2  1003 2013-01-04   guangzhou    110-A  54.0  2133.0    male   Y       20
    3  1004 2013-01-05    shanghai    110-C  32.0  5433.0  female   Y       40
    4  1005 2013-01-06         sh     210-A  34.0     NaN    male   N       40
    5  1006 2013-01-07    BEIJING     130-F  32.0  4432.0  female   Y       40
    6  1007        NaT         NaN      NaN   NaN     NaN    male   N       30
    7  1008        NaT         NaN      NaN   NaN     NaN  female   Y       20
    ==================================结果分割线==================================
         id       date        city category   age   price  gender pay  m-point
    0  1001 2013-01-02    Beijing     100-A  23.0  1200.0    male   Y       10
    1  1002 2013-01-03          SH    100-B  44.0     NaN  female   N       12
    2  1003 2013-01-04   guangzhou    110-A  54.0  2133.0    male   Y       20
    3  1004 2013-01-05    shanghai    110-C  32.0  5433.0  female   Y       40
    4  1005 2013-01-06         sh     210-A  34.0     NaN    male   N       40
    5  1006 2013-01-07    BEIJING     130-F  32.0  4432.0  female   Y       40
    6  1007        NaT         NaN      NaN   NaN     NaN    male   N       30
    7  1008        NaT         NaN      NaN   NaN     NaN  female   Y       20
    ==================================结果分割线==================================
    
    Process finished with exit code 0
    
    
    1. 设置索引,并根据索引排序
    df_inner.set_index('id')    #设置id为索引
    println(df_inner.sort_index(ascending=False)) # 降序排序
    
    D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
         id       date        city category  age   price  gender pay  m-point
    5  1006 2013-01-07    BEIJING     130-F   32  4432.0  female   Y       40
    4  1005 2013-01-06         sh     210-A   34     NaN    male   N       40
    3  1004 2013-01-05    shanghai    110-C   32  5433.0  female   Y       40
    2  1003 2013-01-04   guangzhou    110-A   54  2133.0    male   Y       20
    1  1002 2013-01-03          SH    100-B   44     NaN  female   N       12
    0  1001 2013-01-02    Beijing     100-A   23  1200.0    male   Y       10
    ==================================结果分割线==================================
    
    Process finished with exit code 0
    
    1. 按照某列数据排序, case when 语句
    println(df_inner.sort_values(by=['age'],ascending=False)) # == select * from df_inner order by age desc;
    
    df_inner['group'] = np.where(df_inner['price'] > 3000,'high','low') # == select *,(case price when price>3000 then 'high' else 'low' end)group from df_inner
    println(df_inner)
    
    D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
         id       date        city category  age   price  gender pay  m-point
    2  1003 2013-01-04   guangzhou    110-A   54  2133.0    male   Y       20
    1  1002 2013-01-03          SH    100-B   44     NaN  female   N       12
    4  1005 2013-01-06         sh     210-A   34     NaN    male   N       40
    3  1004 2013-01-05    shanghai    110-C   32  5433.0  female   Y       40
    5  1006 2013-01-07    BEIJING     130-F   32  4432.0  female   Y       40
    0  1001 2013-01-02    Beijing     100-A   23  1200.0    male   Y       10
    ==================================结果分割线==================================
         id       date        city category  age   price  gender pay  m-point group
    0  1001 2013-01-02    Beijing     100-A   23  1200.0    male   Y       10   low
    1  1002 2013-01-03          SH    100-B   44     NaN  female   N       12   low
    2  1003 2013-01-04   guangzhou    110-A   54  2133.0    male   Y       20   low
    3  1004 2013-01-05    shanghai    110-C   32  5433.0  female   Y       40  high
    4  1005 2013-01-06         sh     210-A   34     NaN    male   N       40   low
    5  1006 2013-01-07    BEIJING     130-F   32  4432.0  female   Y       40  high
    ==================================结果分割线==================================
    
    Process finished with exit code 0
    
    
    1. 新增一列
    df_inner.loc[(df_inner['city'] == 'guangzhou') | (df_inner['price'] >= 4000), 'sign']=1 # loc
    println(df_inner)
    
    D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
         id       date        city category  age   price  gender pay  m-point  sign
    0  1001 2013-01-02    Beijing     100-A   23  1200.0    male   Y       10   NaN
    1  1002 2013-01-03          SH    100-B   44     NaN  female   N       12   NaN
    2  1003 2013-01-04   guangzhou    110-A   54  2133.0    male   Y       20   NaN
    3  1004 2013-01-05    shanghai    110-C   32  5433.0  female   Y       40   1.0
    4  1005 2013-01-06         sh     210-A   34     NaN    male   N       40   NaN
    5  1006 2013-01-07    BEIJING     130-F   32  4432.0  female   Y       40   1.0
    ==================================结果分割线==================================
    
    Process finished with exit code 0
    
    
    1. 数据分列
    # 数据分列
    new_split = pd.DataFrame((x.split('-') for x in df_inner['category']),index=df_inner.index,columns=['category','size'])
    print(new_split)
    
    D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
      category size
    0      100    A
    1      100    B
    2      110    A
    3      110    C
    4      210    A
    5      130    F
    
    Process finished with exit code 0
    
    
    1. 根据索引和区域取值
    # 获取某一行的数据:根据索引
    println(df_inner.loc[3])
    
    # 获取索引区域的值
    println(df_inner.iloc[0:5])
    
    # 获取前三行,前两列的值,此处的数字不是索引哦,而是数据的位置
    println(df_inner.iloc[:3,:2]) #冒号前后的数字不再是索引的标签名称,而是数据所在的位置,从0开始,前三行,前两列。
    
    # 按照位置
    println(df_inner.iloc[[0,2,5],[4,5]]) #提取第0、2、5行,4、5列)
    
    D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
    id                         1004
    date        2013-01-05 00:00:00
    city                   shanghai
    category                  110-C
    age                          32
    price                      5433
    gender                   female
    pay                           Y
    m-point                      40
    Name: 3, dtype: object
    ==================================结果分割线==================================
         id       date        city category  age   price  gender pay  m-point
    0  1001 2013-01-02    Beijing     100-A   23  1200.0    male   Y       10
    1  1002 2013-01-03          SH    100-B   44     NaN  female   N       12
    2  1003 2013-01-04   guangzhou    110-A   54  2133.0    male   Y       20
    3  1004 2013-01-05    shanghai    110-C   32  5433.0  female   Y       40
    4  1005 2013-01-06         sh     210-A   34     NaN    male   N       40
    ==================================结果分割线==================================
         id       date
    0  1001 2013-01-02
    1  1002 2013-01-03
    2  1003 2013-01-04
    ==================================结果分割线==================================
       age   price
    0   23  1200.0
    2   54  2133.0
    5   32  4432.0
    ==================================结果分割线==================================
    
    
    1. 设置索引
    df_inner=df_inner.set_index('date')
    println(df_inner)
    
    D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
                  id        city category  age   price  gender pay  m-point
    date                                                                   
    2013-01-02  1001    Beijing     100-A   23  1200.0    male   Y       10
    2013-01-03  1002          SH    100-B   44     NaN  female   N       12
    2013-01-04  1003   guangzhou    110-A   54  2133.0    male   Y       20
    2013-01-05  1004    shanghai    110-C   32  5433.0  female   Y       40
    2013-01-06  1005         sh     210-A   34     NaN    male   N       40
    2013-01-07  1006    BEIJING     130-F   32  4432.0  female   Y       40
    ==================================结果分割线==================================
    
    Process finished with exit code 0
    
    
    1. 判断是否包含某个数据
    # 判断某列是否包含某个数据
    println(df_inner['city'].isin(['Beijing ']))
    
    # 判断city列里是否包含beijing和shanghai,然后将符合条件的数据提取出来
    println(df_inner.loc[df_inner['city'].isin(['Beijing ','shanghai'])])
    
    D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
    0     True
    1    False
    2    False
    3    False
    4    False
    5    False
    Name: city, dtype: bool
    ==================================结果分割线==================================
         id       date      city category  age   price  gender pay  m-point
    0  1001 2013-01-02  Beijing     100-A   23  1200.0    male   Y       10
    3  1004 2013-01-05  shanghai    110-C   32  5433.0  female   Y       40
    ==================================结果分割线==================================
    
    Process finished with exit code 0
    
    
    1. 截取某列的前几个字符
    println(pd.DataFrame(df_inner['category'].str[:3]))
    
    D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
      category
    0      100
    1      100
    2      110
    3      110
    4      210
    5      130
    ==================================结果分割线==================================
    
    Process finished with exit code 0
    
    
    1. 表关联
    # selet 'id','city','age','category','gender' from df_inner where age>10 and city=Beijing
    println(df_inner.loc[(df_inner['age'] > 10) & (df_inner['city'] == 'Beijing '), ['id','city','age','category','gender']])
    
    # select 'id','city','age','category','gender' from df_inner where age>25 or city=beijing order by age desc;
    println(df_inner.loc[(df_inner['age'] > 25) | (df_inner['city'] == 'beijing'), ['id','city','age','category','gender']].sort_values(['age'],ascending=False))
    
    # select count(*) from df_inner where city != beijing ;
    println(df_inner.loc[(df_inner['city'] != 'beijing')].city.count())
    
    # select * from df_inner where city = Beijing or city = shanghai;
    println(df_inner.query('city == ["Beijing ", "shanghai "]')) # 或的关系
    
    # select sum(price) from df_inner where city = Beijing or city = shanghai;
    println(df_inner.query('city == ["Beijing ", "shanghai "]').price.sum())
    
    D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
         id      city  age category gender
    0  1001  Beijing    23    100-A   male
    ==================================结果分割线==================================
         id        city  age category  gender
    2  1003   guangzhou   54    110-A    male
    1  1002          SH   44    100-B  female
    4  1005         sh    34    210-A    male
    3  1004    shanghai   32    110-C  female
    5  1006    BEIJING    32    130-F  female
    ==================================结果分割线==================================
    6
    ==================================结果分割线==================================
         id       date      city category  age   price gender pay  m-point
    0  1001 2013-01-02  Beijing     100-A   23  1200.0   male   Y       10
    ==================================结果分割线==================================
    1200.0
    ==================================结果分割线==================================
    
    
    1. 分组,统计
    println(df_inner.groupby('city').count())
    println(df_inner.groupby('city')['id'].count())
    println(df_inner.groupby(['city','age'])['id'].count())
    println(df_inner.groupby('city')['price'].agg([len,np.sum, np.mean]))
    
    D:Softwarepython37.32python.exe C:/Users/wang/Desktop/pan.py
                id  date  category  age  price  gender  pay  m-point
    city                                                            
     guangzhou   1     1         1    1      1       1    1        1
    BEIJING      1     1         1    1      1       1    1        1
    Beijing      1     1         1    1      1       1    1        1
    SH           1     1         1    1      0       1    1        1
    sh           1     1         1    1      0       1    1        1
    shanghai     1     1         1    1      1       1    1        1
    ==================================结果分割线==================================
    city
     guangzhou    1
    BEIJING       1
    Beijing       1
    SH            1
    sh            1
    shanghai      1
    Name: id, dtype: int64
    ==================================结果分割线==================================
    city        age
     guangzhou  54     1
    BEIJING     32     1
    Beijing     23     1
    SH          44     1
    sh          34     1
    shanghai    32     1
    Name: id, dtype: int64
    ==================================结果分割线==================================
                len     sum    mean
    city                           
     guangzhou  1.0  2133.0  2133.0
    BEIJING     1.0  4432.0  4432.0
    Beijing     1.0  1200.0  1200.0
    SH          1.0     0.0     NaN
    sh          1.0     0.0     NaN
    shanghai    1.0  5433.0  5433.0
    ==================================结果分割线==================================
    
    Process finished with exit code 0
    
    
    1. 写入excel,csv
    df_inner.to_excel('excel_to_python.xlsx', sheet_name='bluewhale_cc')
    df_inner.to_csv('csv_to_python.csv')
    
    1. 读取excel,csv
    df = pd.read_csv('excel_to_python.csv')
    df2 = pd.read_excel('excel_to_python.xlsx')
    

    代码:

    import numpy as np
    import pandas as pd
    
    def println(*args):
        print(*args,end='
    ==================================结果分割线==================================
    ')
    
    
    df = pd.DataFrame(
        {
            "id":[1001,1002,1003,1004,1005,1006],
            "date":pd.date_range('20130102', periods=6),
            "city":['Beijing ', 'SH', ' guangzhou', 'shanghai', 'sh ', 'BEIJING '],
            "age":[23,44,54,32,34,32],
            "category":['100-A','100-B','110-A','110-C','210-A','130-F'],
            "price":[1200,np.nan,2133,5433,np.nan,4432]
        },
        columns =['id','date','city','category','age','price'])
    
    println(df,type(df))
    println(df.shape) # (col,row)
    println(df.info())
    println(df.dtypes)
    println(df['id'].dtype)
    println(df.isnull())
    println(df['price'].isnull())
    
    println(df['id'].unique())
    println(df.values)
    println(df.columns)
    println(df.head()) #默认前10行数据
    println(df.tail()) #默认后10行数据
    
    df.fillna(value=0,inplace=True)
    println(df)
    
    println(df['price'].mean())
    
    df['city']=df['city'].map(str.strip)
    println(df)
    
    df['city'] = df['city'].str.lower()
    println(df['city'])
    df['price'].fillna(value = 0,inplace=True)
    println(df['price'].astype('int'))
    
    df.rename(columns={'category': 'category-size'},inplace=True)
    println(df)
    
    df['city'].drop_duplicates()
    df['city'].drop_duplicates(keep='last')
    
    println(df['city'].replace('sh ', 'shanghai'))
    
    
    df1=pd.DataFrame(
        {
            "id":[1001,1002,1003,1004,1005,1006,1007,1008],
            "gender":['male','female','male','female','male','female','male','female'],
            "pay":['Y','N','Y','Y','N','Y','N','Y',],
            "m-point":[10,12,20,40,40,40,30,20]
        })
    
    df_inner = pd.merge(df,df1,how='inner')  # 匹配合并,交集
    println(df_inner)
    
    df_left=pd.merge(df,df1,how='left')
    println(df_left)
    
    df_right=pd.merge(df,df1,how='right')
    println(df_right)
    
    df_outer=pd.merge(df,df1,how='outer')  #并集
    println(df_outer)
    
    df_inner.set_index('id')
    println(df_inner.sort_index(ascending=False))
    
    println(df_inner.sort_values(by=['age'],ascending=False))
    
    df_inner['group'] = np.where(df_inner['price'] > 3000,'high','low')
    println(df_inner)
    
    df_inner.loc[(df_inner['city'] == 'guangzhou') | (df_inner['price'] >= 4000), 'sign']=1
    println(df_inner)
    
    # 数据分列
    new_split = pd.DataFrame((x.split('-') for x in df_inner['category']),index=df_inner.index,columns=['category','size'])
    print(new_split)
    
    # 获取某一行的数据:根据索引
    println(df_inner.loc[3])
    
    # 获取索引区域的值
    println(df_inner.iloc[0:5])
    
    # 获取前三行,前两列的值,此处的数字不是索引哦,而是数据的位置
    println(df_inner.iloc[:3,:2]) #冒号前后的数字不再是索引的标签名称,而是数据所在的位置,从0开始,前三行,前两列。
    
    # 按照位置
    println(df_inner.iloc[[0,2,5],[4,5]]) #提取第0、2、5行,4、5列)
    
    df_inner=df_inner.set_index('date')
    println(df_inner)
    
    # 判断某列是否包含某个数据
    println(df_inner['city'].isin(['Beijing ']))
    
    # 判断city列里是否包含beijing和shanghai,然后将符合条件的数据提取出来
    println(df_inner.loc[df_inner['city'].isin(['Beijing ','shanghai'])])
    
    println(pd.DataFrame(df_inner['category'].str[:3]))
    
    # selet 'id','city','age','category','gender' from df_inner where age>10 and city=Beijing
    println(df_inner.loc[(df_inner['age'] > 10) & (df_inner['city'] == 'Beijing '), ['id','city','age','category','gender']])
    
    # select 'id','city','age','category','gender' from df_inner where age>25 or city=beijing order by age desc;
    println(df_inner.loc[(df_inner['age'] > 25) | (df_inner['city'] == 'beijing'), ['id','city','age','category','gender']].sort_values(['age'],ascending=False))
    
    # select count(*) from df_inner where city != beijing ;
    println(df_inner.loc[(df_inner['city'] != 'beijing')].city.count())
    
    # select * from df_inner where city = Beijing or city = shanghai;
    println(df_inner.query('city == ["Beijing ", "shanghai "]')) # 或的关系
    
    # select sum(price) from df_inner where city = Beijing or city = shanghai;
    println(df_inner.query('city == ["Beijing ", "shanghai "]').price.sum())
    
    # 分组,统计
    println(df_inner.groupby('city').count())
    println(df_inner.groupby('city')['id'].count())
    println(df_inner.groupby(['city','age'])['id'].count())
    println(df_inner.groupby('city')['price'].agg([len,np.sum, np.mean]))
    
    df_inner.to_excel('excel_to_python.xlsx', sheet_name='bluewhale_cc')
    df_inner.to_csv('excel_to_python.csv')
    
    df = pd.read_csv('excel_to_python.csv')
    df2 = pd.read_excel('excel_to_python.xlsx')
    
  • 相关阅读:
    分库分表的技术演进
    Mybatis传递多个参数
    tomcat实现热部署的配置
    名为'<Cursor Name>'的游标已存在
    ABP示例程序-使用AngularJs,ASP.NET MVC,Web API和EntityFramework创建N层的单页面Web应用
    ABP官方文档翻译 10.1 ABP Nuget包
    ABP官方文档翻译 9.3 NHibernate集成
    ABP官方文档翻译 9.2 Entity Framework Core
    ABP官方文档翻译 9.1 EntityFramework集成
    ABP官方文档翻译 8.2 SignalR集成
  • 原文地址:https://www.cnblogs.com/wztshine/p/13959885.html
Copyright © 2011-2022 走看看