zoukankan      html  css  js  c++  java
  • 4-Pandas数据预处理之数据转换(df.map()、df.replace())

      在数据分析中,根据需求,有时候需要将一些数据进行转换,而在Pandas中,实现数据转换的常用方法有:

    • 利用函数或是映射
    • 可以将自己定义的或者是其他包提供的函数用在Pandas对象上实现批量修改
    • applymapmap实例方法

      在本节中,使用调查的某公司的员工信息为例:

    numeber_project:员工所在项目个数

    left:该员工是否离职

    salary:工资级别

    >>> import pandas as pd
    >>> import numpy as np
    >>> data  = pd.read_csv('./input/HR.csv',encoding = 'gbk')
    >>> data = data[['number_project','left','salary']]
    >>> data.head()
       number_project  left  salary
    0               2     1     low
    1               5     1  medium
    2               7     1  medium
    3               5     1     low
    4               2     1     low
    

     一、map()、replace()

    (1)使用函数。例:salary列的数据转换成每个单词的字母大写

    >>> data['salary'].map(str.title)[:5]
    0       Low
    1    Medium
    2    Medium
    3       Low
    4       Low
    Name: salary, dtype: object
    

    (2)使用映射关系的字典。例:于left,生成一个指标标量indicator。若为‘YES’,表示left=1,若为‘NO’,表示left=0(一般在数据处理时是将字符处理成0,1...n,在此时为了便于理解,故如此举例)。

    >>> mapper = {0:'NO',1:'YES'}
    >>> data['left'] = data['left'].map(mapper)
    >>> data.head()
       number_project left  salary
    0               2  YES     Low
    1               5  YES  Medium
    2               7  YES  Medium
    3               5  YES     Low
    4               2  YES     Low
    

    注意使用映射关系的字典map()必须考虑到所有的值,若没有,那么没有映射关系的值将会为NaN,如下例子:

    >>> s = pd.Series(['A','B','C'])
    >>> s
    0    A
    1    B
    2    C
    dtype: object
    >>> s.map({'A':10,'B':100})
    0     10.0
    1    100.0
    2      NaN
    dtype: float64

    (3)重命名索引---->通过map方法可以对行索引或是列名的Index对象进行修改(行索引和列明都是Index对象

    >>> data.columns
    Index(['number_project', 'left', 'salary'], dtype='object')
    >>> data.columns.map(str.upper)
    Index(['NUMBER_PROJECT', 'LEFT', 'SALARY'], dtype='object')

    (4)使用映射,若需要将数据按照一定的映射关系进行替换,使用replace()。多个值的替换可以用列表少数的值可以用包含映射关系的字典字典。

    例:将number_project的值2、3、4设置为less,5、6、7设置为More。

    >>> data['number_project'] = data['number_project'].replace([2,3,4,5,6,7],['Less','Less','Less','More','More','More'])
    >>> data.head()
      number_project left  salary
    0           Less  YES     Low
    1           More  YES  Medium
    2           More  YES  Medium
    3           More  YES     Low
    4           Less  YES     Low
    

       现有一份数据test_loan,如下:

     usertermint_rategradeloan_status
    389 8 36 months 13.66% C Fully Paid
    417 9 36 months 11.99% B Charged Off
    705 6 60 months 15.59% D Fully Paid
    921 7 60 months 11.44% B Fully Paid
    1138 4 36 months 13.66% C Fully Paid
    1251 5 36 months 13.66% C

    Charged Off

    1)loan_status状态为"Charged Off"的贷款有违约风险,视为不良贷款,将其值标记为1,其他贷款标记为0。我们使用replace()进行值替换

    test_loan['loan_status']=test_loan['loan_status'].replace(["Charged Off","Fully Paid"],[1,0])
    	user	term	int_rate	grade	loan_status
    389	8	36 months	13.66%	C	0
    417	9	36 months	11.99%	B	1
    705	6	60 months	15.59%	D	0
    921	7	60 months	11.44%	B	0
    1138	4	36 months	13.66%	C	0
    1251	5	36 months	13.66%	C	1
    

    2)replace()也可以同时指定不同变量的不同值替换为相同新值

    test_loan.replace(to_replace={'loan_status':0,'grade':'B'},value='Good')
    
    	user	term	int_rate	grade	loan_status
    389	8	36 months	13.66%	C	Good
    417	9	36 months	11.99%	Good	Charged Off
    705	6	60 months	15.59%	D	Good
    921	7	60 months	11.44%	Good	Good
    1138	4	36 months	13.66%	C	Good
    1251	5	36 months	13.66%	C	Charged Off
    

    说明to_replace指需要替换的值,value指要替换成的新值。replace作为数值替换的方法,适用范围非常之广,可以实现多种操作。

     3)也可以使用正则进行替换,设置regex=True即可,代表to_replace部分输入的是正则表达式部分

      例:将D开头的全部内容替换成Bad

    test_loan.replace(to_replace='D+.*$',value='Bad',regex=True)
    
    	user	term	int_rate	grade	loan_status
    389	8	36 months	13.66%	C	Fully Paid
    417	9	36 months	11.99%	B	Charged Off
    705	6	60 months	15.59%	Bad	Fully Paid
    921	7	60 months	11.44%	B	Fully Paid
    1138	4	36 months	13.66%	C	Fully Paid
    1251	5	36 months	13.66%	C	Charged Off
    

      

      

  • 相关阅读:
    03_ if 练习 _ little2big
    uva 11275 3D Triangles
    uva 12296 Pieces and Discs
    uvalive 3218 Find the Border
    uvalive 2797 Monster Trap
    uvalive 4992 Jungle Outpost
    uva 2218 Triathlon
    uvalive 3890 Most Distant Point from the Sea
    uvalive 4728 Squares
    uva 10256 The Great Divide
  • 原文地址:https://www.cnblogs.com/Cheryol/p/13415877.html
Copyright © 2011-2022 走看看