zoukankan      html  css  js  c++  java
  • Pandas系列(二)- DataFrame数据框

    一、初识DataFrame

      dataFrame 是一个带有索引的二维数据结构,每列可以有自己的名字,并且可以有不同的数据类型。你可以把它想象成一个 excel 表格或者数据库中的一张表DataFrame是最常用的 Pandas 对象。

    二、数据框的创建

      1.字典套列表方式创建

    index = pd.Index(data=["Tom", "Bob", "Mary", "James"], name="name")
    data = {
        "age": [18, 30, 25, 40],
        "city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]
    }
    user_info = pd.DataFrame(data=data, index=index)
    user_info
    Out[35]: 
           age       city
    name                 
    Tom     18    BeiJing
    Bob     30   ShangHai
    Mary    25  GuangZhou
    James   40   ShenZhen
    

      2. 列表套字典方式创建

    data = [{'name':'xiaohong','age':23,'tel':10086},{'name':'xiaogang','age':12},{'name':'xiaozhang','tel':10010}]
    user_info = pd.DataFrame(data=data)
    user_info
    Out[36]: 
        age       name      tel
    0  23.0   xiaohong  10086.0
    1  12.0   xiaogang      NaN
    2   NaN  xiaozhang  10010.0
    

      3.数组方式创建

    data = [[18, "BeiJing"], 
            [30, "ShangHai"], 
            [25, "GuangZhou"], 
            [40, "ShenZhen"]]
    columns = ["age", "city"]
    user_info = pd.DataFrame(data=data, index=index, columns=columns)
    user_info
    Out[37]: 
           age       city
    name                 
    Tom     18    BeiJing
    Bob     30   ShangHai
    Mary    25  GuangZhou
    James   40   ShenZhen
    

      4.from_dict方式

    result = {'name': 'zhangyafei','age': 24, 'city':'shanxi','weather':'sunny','date':'2019-3-11'}
    data = pd.DataFrame.from_dict(result,orient='index').T
    data
    Out[44]: 
             name age    city weather       date
    0  zhangyafei  24  shanxi   sunny  2019-3-11 

    二、数据框的增删改查

      数据准备

    index = pd.Index(data=["Tom", "Bob", "Mary", "James"], name="name")
    data = {
        "age": [18, 30, 25, 40],
        "city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"]
    }
    df = pd.DataFrame(data=data, index=index)
    df
    Out[45]: 
           age       city
    name                 
    Tom     18    BeiJing
    Bob     30   ShangHai
    Mary    25  GuangZhou
    James   40   ShenZhen
    1. 增加

    #增加行 注意:这种方法,效率非常低,不应该用于遍历中
    df.loc[len(df)]=[23,'shanxi']
    #增加列
    df['sex'] = [1,1,1,0]
    df.assign(age_add_one = df.age + 1)
    df.loc[len(df)] = [23, 'shanxi']
    df
    Out[47]: 
           age       city
    name                 
    Tom     18    BeiJing
    Bob     30   ShangHai
    Mary    25  GuangZhou
    James   40   ShenZhen
    4       23     shanxi
    df['sex'] = [1,1,1,0,1]
    df
    Out[49]: 
           age       city  sex
    name                      
    Tom     18    BeiJing    1
    Bob     30   ShangHai    1
    Mary    25  GuangZhou    1
    James   40   ShenZhen    0
    4       23     shanxi    1
    df.assign(age_add_one = user_info["age"] + 1)
    Out[79]: 
           age       city  age_add_one
    name                              
    Tom     18    BeiJing           19
    Bob     30   ShangHai           31
    Mary    25  GuangZhou           26
    James   40   ShenZhen           41
    增加行列  

       2.  删

    #根据行索引剔除
    df = df.drop(4,axis=0,inplace=True)  # inplace可选
    #根据列名剔除
    df.drop('sex',axis=1,inplace=True)
    df.pop('sex')   # 有返回值
    #第二种剔除列的方法
    del df['age2']  
    df.drop('sex', axis=1, inplace=True)
    df
    Out[65]: 
           age       city
    name                 
    Tom     18    BeiJing
    Bob     30   ShangHai
    Mary    25  GuangZhou
    James   40   ShenZhen
          23     shanxi
    df.drop(4,axis=0, inplace=True)
    df
    Out[67]: 
           age       city
    name                 
    Tom     18    BeiJing
    Bob     30   ShangHai
    Mary    25  GuangZhou
    James   40   ShenZhen
    df['sex'] = [1,1,0,1]
    df
    Out[71]: 
           age       city  sex
    name                      
    Tom     18    BeiJing    1
    Bob     30   ShangHai    1
    Mary    25  GuangZhou    0
    James   40   ShenZhen    1
    del df['sex']
    df
    Out[73]: 
           age       city
    name                 
    Tom     18    BeiJing
    Bob     30   ShangHai
    Mary    25  GuangZhou
    James   40   ShenZhen
    
    df.pop('sex')
    Out[77]: 
    name
    Tom      1
    Bob      1
    Mary     0
    James    1
    Name: sex, dtype: int64
    删除行列

      3. 修改

    # 方式一:
    df.columns = ['Age','City','Sex']
    df.index = ['tom','bob']
    # 方式二   推荐使用方式二
    df.rename(columns={"age": "Age", "city": "City", "sex": "Sex"})
    df.rename(index={"Tom": "tom", "Bob": "bob"})
    

      4. 查

    # 访问行
    df[0:4:2]   # 按行序号访问
    df.loc['Tom',]   # 按行索引访问
    df.iloc[:1,]   # 按行序号访问
    
    # 访问列
    df['age']   # 列名访问,多列用数组
    df.loc[:,'age']
    df.iloc[:, 0:1]
    
    # 访问行列
    df.loc['Tom','age]   # 按行索引访问
    df.iloc[:1,:1]   # 按行序号访问
    
    
    # 根据条件逻辑值取值
    df[df.age>=30] 
    df[1:2]
    Out[82]: 
          age      city
    name               
    Bob    30  ShangHai
    df[0:4:2]
    Out[83]: 
          age       city
    name                
    Tom    18    BeiJing
    Mary   25  GuangZhou
    
    df.loc['Tom',]
    Out[88]: 
    age          18
    city    BeiJing
    Name: Tom, dtype: object
    df
    Out[89]: 
           age       city
    name                 
    Tom     18    BeiJing
    Bob     30   ShangHai
    Mary    25  GuangZhou
    James   40   ShenZhen
    df.iloc[:1,]
    Out[90]: 
          age     city
    name              
    Tom    18  BeiJing
    df['age']
    Out[91]: 
    name
    Tom      18
    Bob      30
    Mary     25
    James    40
    Name: age, dtype: int64
    df[['age','city']]
    Out[92]: 
           age       city
    name                 
    Tom     18    BeiJing
    Bob     30   ShangHai
    Mary    25  GuangZhou
    James   40   ShenZhen
    
    Out[94]: 
    name
    Tom      18
    Bob      30
    Mary     25
    James    40
    Name: age, dtype: int64
    df.iloc[:, 1]
    Out[95]: 
    name
    Tom        BeiJing
    Bob       ShangHai
    Mary     GuangZhou
    James     ShenZhen
    Name: city, dtype: object
    df.iloc[:, 0]
    Out[96]: 
    name
    Tom      18
    Bob      30
    Mary     25
    James    40
    Name: age, dtype: int64
    df.iloc[:, 0:1]
    Out[97]: 
           age
    name      
    Tom     18
    Bob     30
    Mary    25
    James   40
    df.loc['Tom', 'age']
    Out[98]: 18
    df.iloc[:1, :1]
    Out[99]: 
          age
    name     
    Tom    18
    df
    Out[100]: 
           age       city
    name                 
    Tom     18    BeiJing
    Bob     30   ShangHai
    Mary    25  GuangZhou
    James   40   ShenZhen
    df[df.age >= 30]
    Out[101]: 
           age      city
    name                
    Bob     30  ShangHai
    James   40  ShenZhen
    
    df = pd.DataFrame({'BoolCol': [1, 2, 3, 3, 4],'attr': [22, 33, 22, 44, 66]},
    index=[10,20,30,40,50])
    print(df)
    value= df[(df.BoolCol==3)&(df.attr==22)].values.tolist()[0]
    type(value)
    print(" ".join(str(id) for id in value))
    index = df[(df.BoolCol==3)&(df.attr==22)].index.tolist()
    print(index)
        BoolCol  attr
    10        1    22
    20        2    33
    30        3    22
    40        3    44
    50        4    66
    3 22
    [30]
    查找

    三、数据框的遍历

    #遍历列名 
    for r in df:
      print(r)
    #遍历列
    for cName in df:
      print('df的列:
    ',cName)
      print('df的值:
    ',df[cName])
      print("-"*10)
    
    遍历行
    
    第一种:apply方式 推荐
    
    def new_data(row):
      """增加别名列"""
      drug_name = row['药品名称']
      try:  
      row['别名'] = drug_name.rsplit('(',1)[1].strip(')')
      row['药品名称'] = drug_name.rsplit('(',1)[0]
      except IndexError as e:
      row['别名'] = np.NAN
      return row
    
     new_drug = data.apply(new_data,axis=1)
     
    
    第二种:dataframe.iterrows
    for key, row in data.iterrows():
      drug_name = row['药品名称'].values
      drug_alias = drug_name.rsplit('(',1)[1].strip(')')
      print(drug_name)
      print(drug_alias)
    
    第三种:index方式
    resoved_drug_list = []
    for row in data.index:
      drug_name = '{}[{}]'.format(data.iloc[row]['药品名称'],data.iloc[row]['药品ID'])
      resoved_drug_list.append(drug_name)
    
    第四种:values方式
    for r in df.values:
      print(r)
      print(r[0])
      print(r[1])
      print('-'*10)
    
    第五种:while遍历DataFrame
    df = DataFrame({
    'age':Series([21,22,23]),
    'name':Series(['zhang','liu','kang']) 
    })
    
    rowCount = len(df)
    i = 0
    while i<rowCount:  
      print(df.iloc[i])
      i+=1
    
    补充:
    
    #遍历字符串
    for letter in 'python':
    print('现在是:',letter)
    
    #遍历数组
    fruits = ['banana','apple','mango']
    for fruit in fruits:
    print('现在是:',fruit)
    #遍历序列
    x = Series(['a',True,1],index=['first','second','third'])
    x[0]
    x['second']
    x[2]
    
    for v in x:
        print('x中的值:',v)
    for index in x.index:
        print('X中的索引:',index)
        print('x中的值:',x[index])
        print('*'*10)

    四、函数应用

      虽说 Pandas 为我们提供了非常丰富的函数,有时候我们可能需要自己定制一些函数,并将它应用到 DataFrame 或 Series。常用到的函数有:map、apply、applymap。

    • map 是 Series 中特有的方法,通过它可以对 Series 中的每个元素实现转换。如果我想通过年龄判断用户是否属于中年人(30岁以上为中年),通过 map 可以轻松搞定它。我想要通过城市来判断是南方还是北方,我可以这样操作。
    • apply 方法既支持 Series,也支持 DataFrame,在对 Series 操作时会作用到每个值上,在对 DataFrame 操作时会作用到所有行或所有列(通过 axis 参数控制)。# 对 Series 来说,apply 方法 与 map 方法区别不大。
    • applymap 方法针对于 DataFrame,它作用于 DataFrame 中的每个元素,它对 DataFrame 的效果类似于 apply 对 Series 的效果

      1. 数据

    index = pd.Index(data=["Tom", "Bob", "Mary", "James"], name="name")
    data = {
        "age": [18, 30, 25, 40],
        "city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"],
        "sex": ["male", "male", "female", "male"]
    }
    user_info = pd.DataFrame(data=data, index=index)
    user_info
    Out[117]: 
           age       city     sex
    name                         
    Tom     18    BeiJing    male
    Bob     30   ShangHai    male
    Mary    25  GuangZhou  female
    James   40   ShenZhen    male

      2. 函数应用实例

    # 通过年龄判断用户是否属于中年人(30岁以上为中年)
    user_info.age.map(lambda x: "yes" if x >= 30 else "no")
    Out[118]: 
    name
    Tom       no
    Bob      yes
    Mary      no
    James    yes
    Name: age, dtype: object
    # 通过城市来判断是南方还是北方
    city_map = {
        "BeiJing": "north",
        "ShangHai": "south",
        "GuangZhou": "south",
        "ShenZhen": "south"
    }
    user_info.city.map(city_map)
    Out[119]: 
    name
    Tom      north
    Bob      south
    Mary     south
    James    south
    Name: city, dtype: object
    # 求每一列的最大值
    user_info.apply(lambda x: x.max(), axis=0)
    Out[120]: 
    age           40
    city    ShenZhen
    sex         male
    dtype: object
    
    # 将每个值转化为小写字符串
    user_info.applymap(lambda x: str(x).lower())
    Out[121]: 
          age       city     sex
    name                        
    Tom    18    beijing    male
    Bob    30   shanghai    male
    Mary   25  guangzhou  female
    James  40   shenzhen    male

    六、dataframe的属性

    index = pd.Index(data=["Tom", "Bob", "Mary", "James"], name="name")
    data = {
        "age": [18, 30, 25, 40],
        "city": ["BeiJing", "ShangHai", "GuangZhou", "ShenZhen"],
        "sex": ["male", "male", "female", "male"]
    }
    user_info = pd.DataFrame(data=data, index=index)
    user_info
    Out[122]: 
           age       city     sex
    name                         
    Tom     18    BeiJing    male
    Bob     30   ShangHai    male
    Mary    25  GuangZhou  female
    James   40   ShenZhen    male
    

        1. 基本属性

    user_info.shape  # 查看形状
    user_info.columns # 查看列名
    user_info.dtype  # 查看列数据类型
    user_info.ndim  # 查看数据维度
    user_info.T     # 转置  df.transpose() 相同
    user_info.values   
    user_info.index  
    user_info.shape
    Out[127]: (4, 3)
    user_info.T
    Out[128]: 
    name      Tom       Bob       Mary     James
    age        18        30         25        40
    city  BeiJing  ShangHai  GuangZhou  ShenZhen
    sex      male      male     female      male
    user_info.values
    Out[129]: 
    array([[18, 'BeiJing', 'male'],
           [30, 'ShangHai', 'male'],
           [25, 'GuangZhou', 'female'],
           [40, 'ShenZhen', 'male']], dtype=object)
    user_info.index
    Out[130]: Index(['Tom', 'Bob', 'Mary', 'James'], dtype='object', name='name')
    user_info.ndim
    Out[158]: 2
    user_info.columns
    Out[159]: Index(['age', 'city', 'sex', 'height'], dtype='object')
    user_info.dtypes
    Out[160]: 
    age        int64
    city      object
    sex       object
    height    object
    dtype: object
    基本属性

      2. 基础方法

    user_info.info() #查看整体情况
    user_info.head() #默认查看前5行
    user_info.tail()  # 默认查看后5行
    user_info.select_dtypes(include=['float64']).columns   # 选择特定类型的列 
    user_info.info()
    <class 'pandas.core.frame.DataFrame'>
    Index: 4 entries, Tom to James
    Data columns (total 3 columns):
    age     4 non-null int64
    city    4 non-null object
    sex     4 non-null object
    dtypes: int64(1), object(2)
    memory usage: 288.0+ bytes
    user_info.head()
    Out[125]: 
           age       city     sex
    name                         
    Tom     18    BeiJing    male
    Bob     30   ShangHai    male
    Mary    25  GuangZhou  female
    James   40   ShenZhen    male
    user_info.tail()
    Out[126]: 
           age       city     sex
    name                         
    Tom     18    BeiJing    male
    Bob     30   ShangHai    male
    Mary    25  GuangZhou  female
    James   40   ShenZhen    male
    user_info.select_dtypes(include=['object']).columns
    Out[175]: Index(['city', 'sex', 'height'], dtype='object')
    user_info.select_dtypes(include=['int64']).columns
    Out[179]: Index(['age'], dtype='object')
    基本方法

      3.描述与统计

    user_info.age.sum()
    user_info.age.cumsum() #累加求和
    user_info.describe() #查看数字类型的列整体概况
    user_info.describe(include=['object']) #查看非数字类型的列的整体情况
    user_info.sex.value_counts() #统计某列中每个值出现的次数,相当于分组
    user_info.groupby('sex')['sex'].count()
    user_info.age.idxmax() #获取某列最大值或最小值对应的索引
    user_info.age.idxmin()  
    user_info.age.sum()
    Out[131]: 113
    Out[133]: 
    name
    Tom       18
    Bob       48
    Mary      73
    James    113
    Name: age, dtype: int64
    user_info.describe()
    Out[134]: 
                 age
    count   4.000000
    mean   28.250000
    std     9.251126
    min    18.000000
    25%    23.250000
    50%    27.500000
    75%    32.500000
    max    40.000000
    user_info.sex.value_counts()
    Out[135]: 
    male      3
    female    1
    Name: sex, dtype: int64
    user_info.groupby('sex')['sex'].count()
    Out[136]: 
    sex
    female    1
    male      3
    Name: sex, dtype: int64
    user_info.age.idxmax()
    Out[137]: 'James'
    user_info.age.idxmin()
    Out[138]: 'Tom'
    描述与统计

       4.离散化

    pd.cut(user_info.age,3)
    pd.cut(user_info.age, [1, 18, 30, 50])
    pd.cut(user_info.age, [1, 18, 30, 50], labels=["childhood", "youth", "middle"])
    #cut 进行离散化之外,qcut 也可以实现离散化。cut 是根据每个值的大小来进行离散化的,qcut 是根据每个值出现的次数来进行离散化的。
    pd.qcut(user_info.age, 3)
    pd.cut(user_info.age,3)
    Out[140]: 
    name
    Tom      (17.978, 25.333]
    Bob      (25.333, 32.667]
    Mary     (17.978, 25.333]
    James      (32.667, 40.0]
    Name: age, dtype: category
    Categories (3, interval[float64]): [(17.978, 25.333] < (25.333, 32.667] < (32.667, 40.0]]
    pd.cut(user_info.age, [1, 18, 30, 50])
    Out[141]: 
    name
    Tom       (1, 18]
    Bob      (18, 30]
    Mary     (18, 30]
    James    (30, 50]
    Name: age, dtype: category
    Categories (3, interval[int64]): [(1, 18] < (18, 30] < (30, 50]]
    pd.cut(user_info.age, [1, 18, 30, 50], labels=["childhood", "youth", "middle"])
    Out[142]: 
    name
    Tom      childhood
    Bob          youth
    Mary         youth
    James       middle
    Name: age, dtype: category
    Categories (3, object): [childhood < youth < middle]
    pd.qcut(user_info.age, 3)
    Out[143]: 
    name
    Tom      (17.999, 25.0]
    Bob        (25.0, 30.0]
    Mary     (17.999, 25.0]
    James      (30.0, 40.0]
    Name: age, dtype: category
    Categories (3, interval[float64]): [(17.999, 25.0] < (25.0, 30.0] < (30.0, 40.0]]
    离散化

      5.排序

    #排序功能
    #在进行数据分析时,少不了进行数据排序。Pandas 支持两种排序方式:按轴(索引或列)排序和按实际值排序。
    #sort_index 方法默认是按照索引进行正序排的。
    user_info.sort_index()
    user_info.sort_index(axis=1, ascending=False)#按照列进行倒序排
    #按照实际值来排序
    user_info.sort_values(by="age")
    user_info.sort_values(by=["age", "city"])
    user_info.age.nlargest(2)
    #一般在排序后,我们可能需要获取最大的n个值或最小值的n个值,我们可以使用 nlargest和 
    #nsmallest 方法来完成,这比先进行排序,再使用 head(n) 方法快得多。
    
    user_info
    Out[144]: 
           age       city     sex
    name                         
    Tom     18    BeiJing    male
    Bob     30   ShangHai    male
    Mary    25  GuangZhou  female
    James   40   ShenZhen    male
    user_info.sort_index()
    Out[145]: 
           age       city     sex
    name                         
    Bob     30   ShangHai    male
    James   40   ShenZhen    male
    Mary    25  GuangZhou  female
    Tom     18    BeiJing    male
    user_info.sort_index(axis=1, ascending=False)#按照列进行倒序排
    Out[148]: 
              sex       city  age
    name                         
    Tom      male    BeiJing   18
    Bob      male   ShangHai   30
    Mary   female  GuangZhou   25
    James    male   ShenZhen   40
    user_info.sort_values(by="age")
    Out[149]: 
           age       city     sex
    name                         
    Tom     18    BeiJing    male
    Mary    25  GuangZhou  female
    Bob     30   ShangHai    male
    James   40   ShenZhen    male
    user_info.sort_values(by=["age", "city"])
    Out[150]: 
           age       city     sex
    name                         
    Tom     18    BeiJing    male
    Mary    25  GuangZhou  female
    Bob     30   ShangHai    male
    James   40   ShenZhen    male
    user_info.age.nlargest(2)
    Out[151]: 
    name
    James    40
    Bob      30
    Name: age, dtype: int64
    排序

    七、类型操作

      如果想要获取每种类型的列数的话,可以使用 get_dtype_counts 方法。
      如果想要转换数据类型的话,可以通过 astype 来完成。
      有时候会涉及到将 object 类型转为其他类型,常见的有转为数字、日期、时间差,Pandas 中分别对应 to_numeric、to_datetime、to_timedelta 方法。

      这里给这些用户都添加一些关于身高的信息。现在将身高这一列转为数字,很明显,180cm 并非数字,为了强制转换,我们可以传入 errors 参数,这个参数的作用是当强转失败时的处理方式。默认情况下,errors='raise',这意味着强转失败后直接抛出异常,设置 errors='coerce' 可以在强转失败时将有问题的元素赋值为 pd.NaT(对于datetime和timedelta)或 np.nan(数字)。设置 errors='ignore' 可以在强转失败时返回原有的数据。

    user_info.get_dtype_counts()
    user_info["age"].astype(float)
    user_info["height"] = ["178", "168", "178", "180cm"]
    user_info
    pd.to_numeric(user_info.height, errors="coerce")
    pd.to_numeric(user_info.height, errors="ignore") 
    user_info.get_dtype_counts()
    Out[152]: 
    int64     1
    object    2
    dtype: int64
    user_info["age"].astype(float)
    Out[153]: 
    name
    Tom      18.0
    Bob      30.0
    Mary     25.0
    James    40.0
    Name: age, dtype: float64
    user_info["height"] = ["178", "168", "178", "180cm"]
    user_info
    Out[155]: 
           age       city     sex height
    name                                
    Tom     18    BeiJing    male    178
    Bob     30   ShangHai    male    168
    Mary    25  GuangZhou  female    178
    James   40   ShenZhen    male  180cm
    pd.to_numeric(user_info.height, errors="coerce")
    Out[156]: 
    name
    Tom      178.0
    Bob      168.0
    Mary     178.0
    James      NaN
    Name: height, dtype: float64
    pd.to_numeric(user_info.height, errors="ignore")
    Out[157]: 
    name
    Tom        178
    Bob        168
    Mary       178
    James    180cm
    Name: height, dtype: object
    类型操作
  • 相关阅读:
    OS + Linux + zipTool / tar / tar.gz / zst
    project scm
    product wiki confluence
    script ActionScript / ColdFusion
    链表例题
    链表原理
    链表例题
    链表原理
    链表原理
    链表原理
  • 原文地址:https://www.cnblogs.com/zhangyafei/p/9944853.html
Copyright © 2011-2022 走看看