zoukankan      html  css  js  c++  java
  • 数据清洗合并和转化

    数据清洗

    • 数据清洗是数据分析关键的一步,直接影响之后的处理工作

    • 数据需要修改吗?有什么需要修改的吗?数据应该怎么调整才能适用于接下来的分析和挖掘?

    • 是一个迭代的过程,实际项目中可能需要不止一次地执行这些清洗操作

    • 处理缺失数据:pd.fillna(),pd.dropna()

    数据连接(pd.merge)

    • pd.merge

    • 根据单个或多个键将不同DataFrame的行连接起来

    • 类似数据库的连接操作

    示例代码:

    import pandas as pd
    import numpy as np
    
    df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                            'data1' : np.random.randint(0,10,7)})
    df_obj2 = pd.DataFrame({'key': ['a', 'b', 'd'],
                            'data2' : np.random.randint(0,10,3)})
    
    print(df_obj1)
    print(df_obj2)
    

    运行结果:

       data1 key
       data1 key
    0      8   b
    1      8   b
    2      3   a
    3      5   c
    4      4   a
    5      9   a
    6      6   b
    
       data2 key
    0      9   a
    1      0   b
    2      3   d
    

    1. 默认将重叠列的列名作为“外键”进行连接

    示例代码:

    # 默认将重叠列的列名作为“外键”进行连接
    print(pd.merge(df_obj1, df_obj2))
    

    运行结果:

       data1 key  data2
    0      8   b      0
    1      8   b      0
    2      6   b      0
    3      3   a      9
    4      4   a      9
    5      9   a      9
    

    2. on显示指定“外键”

    示例代码:

    # on显示指定“外键”
    print(pd.merge(df_obj1, df_obj2, on='key'))
    

    运行结果:

       data1 key  data2
    0      8   b      0
    1      8   b      0
    2      6   b      0
    3      3   a      9
    4      4   a      9
    5      9   a      9
    

    3. left_on,左侧数据的“外键”,right_on,右侧数据的“外键”

    示例代码:

    # left_on,right_on分别指定左侧数据和右侧数据的“外键”
    
    # 更改列名
    df_obj1 = df_obj1.rename(columns={'key':'key1'})
    df_obj2 = df_obj2.rename(columns={'key':'key2'})
    
    print(pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2'))
    

    运行结果:

       data1 key1  data2 key2
    0      8    b      0    b
    1      8    b      0    b
    2      6    b      0    b
    3      3    a      9    a
    4      4    a      9    a
    5      9    a      9    a
    

    默认是“内连接”(inner),即结果中的键是交集

    how指定连接方式

    4. “外连接”(outer),结果中的键是并集

    示例代码:

    # “外连接”
    print(pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='outer'))
    

    运行结果:

       data1 key1  data2 key2
    0    8.0    b    0.0    b
    1    8.0    b    0.0    b
    2    6.0    b    0.0    b
    3    3.0    a    9.0    a
    4    4.0    a    9.0    a
    5    9.0    a    9.0    a
    6    5.0    c    NaN  NaN
    7    NaN  NaN    3.0    d
    

    5. “左连接”(left)

    示例代码:

    # 左连接
    print(pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='left'))
    

    运行结果:

       data1 key1  data2 key2
    0      8    b    0.0    b
    1      8    b    0.0    b
    2      3    a    9.0    a
    3      5    c    NaN  NaN
    4      4    a    9.0    a
    5      9    a    9.0    a
    6      6    b    0.0    b
    

    6. “右连接”(right)

    示例代码:

    # 右连接
    print(pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='right'))
    

    运行结果:

       data1 key1  data2 key2
    0    8.0    b      0    b
    1    8.0    b      0    b
    2    6.0    b      0    b
    3    3.0    a      9    a
    4    4.0    a      9    a
    5    9.0    a      9    a
    6    NaN  NaN      3    d
    

    7. 处理重复列名

    suffixes,默认为_x, _y

    示例代码:

    
    # 处理重复列名
    df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                            'data' : np.random.randint(0,10,7)})
    df_obj2 = pd.DataFrame({'key': ['a', 'b', 'd'],
                            'data' : np.random.randint(0,10,3)})
    
    print(pd.merge(df_obj1, df_obj2, on='key', suffixes=('_left', '_right')))
    

    运行结果:

       data_left key  data_right
    0          9   b           1
    1          5   b           1
    2          1   b           1
    3          2   a           8
    4          2   a           8
    5          5   a           8
    

    8. 按索引连接

    left_index=True或right_index=True

    示例代码:

    
    # 按索引连接
    df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                            'data1' : np.random.randint(0,10,7)})
    df_obj2 = pd.DataFrame({'data2' : np.random.randint(0,10,3)}, index=['a', 'b', 'd'])
    
    print(pd.merge(df_obj1, df_obj2, left_on='key', right_index=True))
    

    运行结果:

       data1 key  data2
    0      3   b      6
    1      4   b      6
    6      8   b      6
    2      6   a      0
    4      3   a      0
    5      0   a      0
    

    数据合并(pd.concat)

    • 沿轴方向将多个对象合并到一起

    1. NumPy的concat

    np.concatenate

    示例代码:

    import numpy as np
    import pandas as pd
    
    arr1 = np.random.randint(0, 10, (3, 4))
    arr2 = np.random.randint(0, 10, (3, 4))
    
    print(arr1)
    print(arr2)
    
    print(np.concatenate([arr1, arr2]))
    print(np.concatenate([arr1, arr2], axis=1))
    

    运行结果:

    # print(arr1)
    [[3 3 0 8]
     [2 0 3 1]
     [4 8 8 2]]
    
    # print(arr2)
    [[6 8 7 3]
     [1 6 8 7]
     [1 4 7 1]]
    
    # print(np.concatenate([arr1, arr2]))
     [[3 3 0 8]
     [2 0 3 1]
     [4 8 8 2]
     [6 8 7 3]
     [1 6 8 7]
     [1 4 7 1]]
    
    # print(np.concatenate([arr1, arr2], axis=1)) 
    [[3 3 0 8 6 8 7 3]
     [2 0 3 1 1 6 8 7]
     [4 8 8 2 1 4 7 1]]
    

    2. pd.concat

    • 注意指定轴方向,默认axis=0

    • join指定合并方式,默认为outer

    • Series合并时查看行索引有无重复

    1) index 没有重复的情况

    示例代码:

    # index 没有重复的情况
    ser_obj1 = pd.Series(np.random.randint(0, 10, 5), index=range(0,5))
    ser_obj2 = pd.Series(np.random.randint(0, 10, 4), index=range(5,9))
    ser_obj3 = pd.Series(np.random.randint(0, 10, 3), index=range(9,12))
    
    print(ser_obj1)
    print(ser_obj2)
    print(ser_obj3)
    
    print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
    print(pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1))
    

    运行结果:

    # print(ser_obj1)
    0    1
    1    8
    2    4
    3    9
    4    4
    dtype: int64
    
    # print(ser_obj2)
    5    2
    6    6
    7    4
    8    2
    dtype: int64
    
    # print(ser_obj3)
    9     6
    10    2
    11    7
    dtype: int64
    
    # print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
    0     1
    1     8
    2     4
    3     9
    4     4
    5     2
    6     6
    7     4
    8     2
    9     6
    10    2
    11    7
    dtype: int64
    
    # print(pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1))
          0    1    2
    0   1.0  NaN  NaN
    1   5.0  NaN  NaN
    2   3.0  NaN  NaN
    3   2.0  NaN  NaN
    4   4.0  NaN  NaN
    5   NaN  9.0  NaN
    6   NaN  8.0  NaN
    7   NaN  3.0  NaN
    8   NaN  6.0  NaN
    9   NaN  NaN  2.0
    10  NaN  NaN  3.0
    11  NaN  NaN  3.0
    
    2) index 有重复的情况

    示例代码:

    # index 有重复的情况
    ser_obj1 = pd.Series(np.random.randint(0, 10, 5), index=range(5))
    ser_obj2 = pd.Series(np.random.randint(0, 10, 4), index=range(4))
    ser_obj3 = pd.Series(np.random.randint(0, 10, 3), index=range(3))
    
    print(ser_obj1)
    print(ser_obj2)
    print(ser_obj3)
    
    print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
    

    运行结果:

    # print(ser_obj1)
    0    0
    1    3
    2    7
    3    2
    4    5
    dtype: int64
    
    # print(ser_obj2)
    0    5
    1    1
    2    9
    3    9
    dtype: int64
    
    # print(ser_obj3)
    0    8
    1    7
    2    9
    dtype: int64
    
    # print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
    0    0
    1    3
    2    7
    3    2
    4    5
    0    5
    1    1
    2    9
    3    9
    0    8
    1    7
    2    9
    dtype: int64
    
    # print(pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1, join='inner')) 
    # join='inner' 将去除NaN所在的行或列
       0  1  2
    0  0  5  8
    1  3  1  7
    2  7  9  9
    
    3) DataFrame合并时同时查看行索引和列索引有无重复

    示例代码:

    df_obj1 = pd.DataFrame(np.random.randint(0, 10, (3, 2)), index=['a', 'b', 'c'],
                           columns=['A', 'B'])
    df_obj2 = pd.DataFrame(np.random.randint(0, 10, (2, 2)), index=['a', 'b'],
                           columns=['C', 'D'])
    print(df_obj1)
    print(df_obj2)
    
    print(pd.concat([df_obj1, df_obj2]))
    print(pd.concat([df_obj1, df_obj2], axis=1, join='inner'))
    

    运行结果:

    # print(df_obj1)
       A  B
    a  3  3
    b  5  4
    c  8  6
    
    # print(df_obj2)
       C  D
    a  1  9
    b  6  8
    
    # print(pd.concat([df_obj1, df_obj2]))
         A    B    C    D
    a  3.0  3.0  NaN  NaN
    b  5.0  4.0  NaN  NaN
    c  8.0  6.0  NaN  NaN
    a  NaN  NaN  1.0  9.0
    b  NaN  NaN  6.0  8.0
    
    # print(pd.concat([df_obj1, df_obj2], axis=1, join='inner'))
       A  B  C  D
    a  3  3  1  9
    b  5  4  6  8
    

    数据重构

    1. stack

    • 将列索引旋转为行索引,完成层级索引

    • DataFrame->Series

    示例代码:

    import numpy as np
    import pandas as pd
    
    df_obj = pd.DataFrame(np.random.randint(0,10, (5,2)), columns=['data1', 'data2'])
    print(df_obj)
    
    stacked = df_obj.stack()
    print(stacked)
    

    运行结果:

    # print(df_obj)
       data1  data2
    0      7      9
    1      7      8
    2      8      9
    3      4      1
    4      1      2
    
    # print(stacked)
    0  data1    7
       data2    9
    1  data1    7
       data2    8
    2  data1    8
       data2    9
    3  data1    4
       data2    1
    4  data1    1
       data2    2
    dtype: int64
    

    2. unstack

    • 将层级索引展开

    • Series->DataFrame

    • 认操作内层索引,即level=-1

    示例代码:

    # 默认操作内层索引
    print(stacked.unstack())
    
    # 通过level指定操作索引的级别
    print(stacked.unstack(level=0))
    

    运行结果:

    # print(stacked.unstack())
       data1  data2
    0      7      9
    1      7      8
    2      8      9
    3      4      1
    4      1      2
    
    # print(stacked.unstack(level=0))
           0  1  2  3  4
    data1  7  7  8  4  1
    data2  9  8  9  1  2
    

    数据转换

    一、 处理重复数据

    duplicated() 返回布尔型Series表示每行是否为重复行

    示例代码:

    import numpy as np
    import pandas as pd
    
    df_obj = pd.DataFrame({'data1' : ['a'] * 4 + ['b'] * 4,
                           'data2' : np.random.randint(0, 4, 8)})
    print(df_obj)
    
    print(df_obj.duplicated())
    

    运行结果:

    # print(df_obj)
      data1  data2
    0     a      3
    1     a      2
    2     a      3
    3     a      3
    4     b      1
    5     b      0
    6     b      3
    7     b      0
    
    # print(df_obj.duplicated())
    0    False
    1    False
    2     True
    3     True
    4    False
    5    False
    6    False
    7     True
    dtype: bool
    

    drop_duplicates() 过滤重复行

    默认判断全部列

    可指定按某些列判断

    示例代码:

    print(df_obj.drop_duplicates())
    print(df_obj.drop_duplicates('data2'))
    

    运行结果:

    # print(df_obj.drop_duplicates())
      data1  data2
    0     a      3
    1     a      2
    4     b      1
    5     b      0
    6     b      3
    
    # print(df_obj.drop_duplicates('data2'))
      data1  data2
    0     a      3
    1     a      2
    4     b      1
    5     b      0
    

    3. 根据map传入的函数对每行或每列进行转换

    • Series根据map传入的函数对每行或每列进行转换

    示例代码:

    ser_obj = pd.Series(np.random.randint(0,10,10))
    print(ser_obj)
    
    print(ser_obj.map(lambda x : x ** 2))
    

    运行结果:

    # print(ser_obj)
    0    1
    1    4
    2    8
    3    6
    4    8
    5    6
    6    6
    7    4
    8    7
    9    3
    dtype: int64
    
    # print(ser_obj.map(lambda x : x ** 2))
    0     1
    1    16
    2    64
    3    36
    4    64
    5    36
    6    36
    7    16
    8    49
    9     9
    dtype: int64
    

    二、数据替换

    replace根据值的内容进行替换

    示例代码:

    # 单个值替换单个值
    print(ser_obj.replace(1, -100))
    
    # 多个值替换一个值
    print(ser_obj.replace([6, 8], -100))
    
    # 多个值替换多个值
    print(ser_obj.replace([4, 7], [-100, -200]))
    

    运行结果:

    # print(ser_obj.replace(1, -100))
    0   -100
    1      4
    2      8
    3      6
    4      8
    5      6
    6      6
    7      4
    8      7
    9      3
    dtype: int64
    
    # print(ser_obj.replace([6, 8], -100))
    0      1
    1      4
    2   -100
    3   -100
    4   -100
    5   -100
    6   -100
    7      4
    8      7
    9      3
    dtype: int64
    
    # print(ser_obj.replace([4, 7], [-100, -200]))
    0      1
    1   -100
    2      8
    3      6
    4      8
    5      6
    6      6
    7   -100
    8   -200
    9      3
    dtype: int64
  • 相关阅读:
    continue用法
    break用法
    VLAN的划分
    子网掩码的计算
    简述RIP路由协议和OSPF路由协议的相同点和不同点。
    工程监理的内容是什么?
    工程监理的意义和职责是什么?
    双绞线测试的参数主要有哪些?
    光纤熔接损耗原因?
    综合布线系统的设计等级有哪几种?各有什么特点?
  • 原文地址:https://www.cnblogs.com/alexzhang92/p/9794037.html
Copyright © 2011-2022 走看看