zoukankan      html  css  js  c++  java
  • 数据清洗合并和转化

    数据清洗

    • 数据清洗是数据分析关键的一步,直接影响之后的处理工作

    • 数据需要修改吗?有什么需要修改的吗?数据应该怎么调整才能适用于接下来的分析和挖掘?

    • 是一个迭代的过程,实际项目中可能需要不止一次地执行这些清洗操作

    • 处理缺失数据:pd.fillna(),pd.dropna()

    数据连接(pd.merge)

    • pd.merge

    • 根据单个或多个键将不同DataFrame的行连接起来

    • 类似数据库的连接操作

    示例代码:

    import pandas as pd
    import numpy as np
    
    df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                            'data1' : np.random.randint(0,10,7)})
    df_obj2 = pd.DataFrame({'key': ['a', 'b', 'd'],
                            'data2' : np.random.randint(0,10,3)})
    
    print(df_obj1)
    print(df_obj2)
    

    运行结果:

       data1 key
       data1 key
    0      8   b
    1      8   b
    2      3   a
    3      5   c
    4      4   a
    5      9   a
    6      6   b
    
       data2 key
    0      9   a
    1      0   b
    2      3   d
    

    1. 默认将重叠列的列名作为“外键”进行连接

    示例代码:

    # 默认将重叠列的列名作为“外键”进行连接
    print(pd.merge(df_obj1, df_obj2))
    

    运行结果:

       data1 key  data2
    0      8   b      0
    1      8   b      0
    2      6   b      0
    3      3   a      9
    4      4   a      9
    5      9   a      9
    

    2. on显示指定“外键”

    示例代码:

    # on显示指定“外键”
    print(pd.merge(df_obj1, df_obj2, on='key'))
    

    运行结果:

       data1 key  data2
    0      8   b      0
    1      8   b      0
    2      6   b      0
    3      3   a      9
    4      4   a      9
    5      9   a      9
    

    3. left_on,左侧数据的“外键”,right_on,右侧数据的“外键”

    示例代码:

    # left_on,right_on分别指定左侧数据和右侧数据的“外键”
    
    # 更改列名
    df_obj1 = df_obj1.rename(columns={'key':'key1'})
    df_obj2 = df_obj2.rename(columns={'key':'key2'})
    
    print(pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2'))
    

    运行结果:

       data1 key1  data2 key2
    0      8    b      0    b
    1      8    b      0    b
    2      6    b      0    b
    3      3    a      9    a
    4      4    a      9    a
    5      9    a      9    a
    

    默认是“内连接”(inner),即结果中的键是交集

    how指定连接方式

    4. “外连接”(outer),结果中的键是并集

    示例代码:

    # “外连接”
    print(pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='outer'))
    

    运行结果:

       data1 key1  data2 key2
    0    8.0    b    0.0    b
    1    8.0    b    0.0    b
    2    6.0    b    0.0    b
    3    3.0    a    9.0    a
    4    4.0    a    9.0    a
    5    9.0    a    9.0    a
    6    5.0    c    NaN  NaN
    7    NaN  NaN    3.0    d
    

    5. “左连接”(left)

    示例代码:

    # 左连接
    print(pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='left'))
    

    运行结果:

       data1 key1  data2 key2
    0      8    b    0.0    b
    1      8    b    0.0    b
    2      3    a    9.0    a
    3      5    c    NaN  NaN
    4      4    a    9.0    a
    5      9    a    9.0    a
    6      6    b    0.0    b
    

    6. “右连接”(right)

    示例代码:

    # 右连接
    print(pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='right'))
    

    运行结果:

       data1 key1  data2 key2
    0    8.0    b      0    b
    1    8.0    b      0    b
    2    6.0    b      0    b
    3    3.0    a      9    a
    4    4.0    a      9    a
    5    9.0    a      9    a
    6    NaN  NaN      3    d
    

    7. 处理重复列名

    suffixes,默认为_x, _y

    示例代码:

    
    # 处理重复列名
    df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                            'data' : np.random.randint(0,10,7)})
    df_obj2 = pd.DataFrame({'key': ['a', 'b', 'd'],
                            'data' : np.random.randint(0,10,3)})
    
    print(pd.merge(df_obj1, df_obj2, on='key', suffixes=('_left', '_right')))
    

    运行结果:

       data_left key  data_right
    0          9   b           1
    1          5   b           1
    2          1   b           1
    3          2   a           8
    4          2   a           8
    5          5   a           8
    

    8. 按索引连接

    left_index=True或right_index=True

    示例代码:

    
    # 按索引连接
    df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                            'data1' : np.random.randint(0,10,7)})
    df_obj2 = pd.DataFrame({'data2' : np.random.randint(0,10,3)}, index=['a', 'b', 'd'])
    
    print(pd.merge(df_obj1, df_obj2, left_on='key', right_index=True))
    

    运行结果:

       data1 key  data2
    0      3   b      6
    1      4   b      6
    6      8   b      6
    2      6   a      0
    4      3   a      0
    5      0   a      0
    

    数据合并(pd.concat)

    • 沿轴方向将多个对象合并到一起

    1. NumPy的concat

    np.concatenate

    示例代码:

    import numpy as np
    import pandas as pd
    
    arr1 = np.random.randint(0, 10, (3, 4))
    arr2 = np.random.randint(0, 10, (3, 4))
    
    print(arr1)
    print(arr2)
    
    print(np.concatenate([arr1, arr2]))
    print(np.concatenate([arr1, arr2], axis=1))
    

    运行结果:

    # print(arr1)
    [[3 3 0 8]
     [2 0 3 1]
     [4 8 8 2]]
    
    # print(arr2)
    [[6 8 7 3]
     [1 6 8 7]
     [1 4 7 1]]
    
    # print(np.concatenate([arr1, arr2]))
     [[3 3 0 8]
     [2 0 3 1]
     [4 8 8 2]
     [6 8 7 3]
     [1 6 8 7]
     [1 4 7 1]]
    
    # print(np.concatenate([arr1, arr2], axis=1)) 
    [[3 3 0 8 6 8 7 3]
     [2 0 3 1 1 6 8 7]
     [4 8 8 2 1 4 7 1]]
    

    2. pd.concat

    • 注意指定轴方向,默认axis=0

    • join指定合并方式,默认为outer

    • Series合并时查看行索引有无重复

    1) index 没有重复的情况

    示例代码:

    # index 没有重复的情况
    ser_obj1 = pd.Series(np.random.randint(0, 10, 5), index=range(0,5))
    ser_obj2 = pd.Series(np.random.randint(0, 10, 4), index=range(5,9))
    ser_obj3 = pd.Series(np.random.randint(0, 10, 3), index=range(9,12))
    
    print(ser_obj1)
    print(ser_obj2)
    print(ser_obj3)
    
    print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
    print(pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1))
    

    运行结果:

    # print(ser_obj1)
    0    1
    1    8
    2    4
    3    9
    4    4
    dtype: int64
    
    # print(ser_obj2)
    5    2
    6    6
    7    4
    8    2
    dtype: int64
    
    # print(ser_obj3)
    9     6
    10    2
    11    7
    dtype: int64
    
    # print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
    0     1
    1     8
    2     4
    3     9
    4     4
    5     2
    6     6
    7     4
    8     2
    9     6
    10    2
    11    7
    dtype: int64
    
    # print(pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1))
          0    1    2
    0   1.0  NaN  NaN
    1   5.0  NaN  NaN
    2   3.0  NaN  NaN
    3   2.0  NaN  NaN
    4   4.0  NaN  NaN
    5   NaN  9.0  NaN
    6   NaN  8.0  NaN
    7   NaN  3.0  NaN
    8   NaN  6.0  NaN
    9   NaN  NaN  2.0
    10  NaN  NaN  3.0
    11  NaN  NaN  3.0
    
    2) index 有重复的情况

    示例代码:

    # index 有重复的情况
    ser_obj1 = pd.Series(np.random.randint(0, 10, 5), index=range(5))
    ser_obj2 = pd.Series(np.random.randint(0, 10, 4), index=range(4))
    ser_obj3 = pd.Series(np.random.randint(0, 10, 3), index=range(3))
    
    print(ser_obj1)
    print(ser_obj2)
    print(ser_obj3)
    
    print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
    

    运行结果:

    # print(ser_obj1)
    0    0
    1    3
    2    7
    3    2
    4    5
    dtype: int64
    
    # print(ser_obj2)
    0    5
    1    1
    2    9
    3    9
    dtype: int64
    
    # print(ser_obj3)
    0    8
    1    7
    2    9
    dtype: int64
    
    # print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
    0    0
    1    3
    2    7
    3    2
    4    5
    0    5
    1    1
    2    9
    3    9
    0    8
    1    7
    2    9
    dtype: int64
    
    # print(pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1, join='inner')) 
    # join='inner' 将去除NaN所在的行或列
       0  1  2
    0  0  5  8
    1  3  1  7
    2  7  9  9
    
    3) DataFrame合并时同时查看行索引和列索引有无重复

    示例代码:

    df_obj1 = pd.DataFrame(np.random.randint(0, 10, (3, 2)), index=['a', 'b', 'c'],
                           columns=['A', 'B'])
    df_obj2 = pd.DataFrame(np.random.randint(0, 10, (2, 2)), index=['a', 'b'],
                           columns=['C', 'D'])
    print(df_obj1)
    print(df_obj2)
    
    print(pd.concat([df_obj1, df_obj2]))
    print(pd.concat([df_obj1, df_obj2], axis=1, join='inner'))
    

    运行结果:

    # print(df_obj1)
       A  B
    a  3  3
    b  5  4
    c  8  6
    
    # print(df_obj2)
       C  D
    a  1  9
    b  6  8
    
    # print(pd.concat([df_obj1, df_obj2]))
         A    B    C    D
    a  3.0  3.0  NaN  NaN
    b  5.0  4.0  NaN  NaN
    c  8.0  6.0  NaN  NaN
    a  NaN  NaN  1.0  9.0
    b  NaN  NaN  6.0  8.0
    
    # print(pd.concat([df_obj1, df_obj2], axis=1, join='inner'))
       A  B  C  D
    a  3  3  1  9
    b  5  4  6  8
    

    数据重构

    1. stack

    • 将列索引旋转为行索引,完成层级索引

    • DataFrame->Series

    示例代码:

    import numpy as np
    import pandas as pd
    
    df_obj = pd.DataFrame(np.random.randint(0,10, (5,2)), columns=['data1', 'data2'])
    print(df_obj)
    
    stacked = df_obj.stack()
    print(stacked)
    

    运行结果:

    # print(df_obj)
       data1  data2
    0      7      9
    1      7      8
    2      8      9
    3      4      1
    4      1      2
    
    # print(stacked)
    0  data1    7
       data2    9
    1  data1    7
       data2    8
    2  data1    8
       data2    9
    3  data1    4
       data2    1
    4  data1    1
       data2    2
    dtype: int64
    

    2. unstack

    • 将层级索引展开

    • Series->DataFrame

    • 认操作内层索引,即level=-1

    示例代码:

    # 默认操作内层索引
    print(stacked.unstack())
    
    # 通过level指定操作索引的级别
    print(stacked.unstack(level=0))
    

    运行结果:

    # print(stacked.unstack())
       data1  data2
    0      7      9
    1      7      8
    2      8      9
    3      4      1
    4      1      2
    
    # print(stacked.unstack(level=0))
           0  1  2  3  4
    data1  7  7  8  4  1
    data2  9  8  9  1  2
    

    数据转换

    一、 处理重复数据

    duplicated() 返回布尔型Series表示每行是否为重复行

    示例代码:

    import numpy as np
    import pandas as pd
    
    df_obj = pd.DataFrame({'data1' : ['a'] * 4 + ['b'] * 4,
                           'data2' : np.random.randint(0, 4, 8)})
    print(df_obj)
    
    print(df_obj.duplicated())
    

    运行结果:

    # print(df_obj)
      data1  data2
    0     a      3
    1     a      2
    2     a      3
    3     a      3
    4     b      1
    5     b      0
    6     b      3
    7     b      0
    
    # print(df_obj.duplicated())
    0    False
    1    False
    2     True
    3     True
    4    False
    5    False
    6    False
    7     True
    dtype: bool
    

    drop_duplicates() 过滤重复行

    默认判断全部列

    可指定按某些列判断

    示例代码:

    print(df_obj.drop_duplicates())
    print(df_obj.drop_duplicates('data2'))
    

    运行结果:

    # print(df_obj.drop_duplicates())
      data1  data2
    0     a      3
    1     a      2
    4     b      1
    5     b      0
    6     b      3
    
    # print(df_obj.drop_duplicates('data2'))
      data1  data2
    0     a      3
    1     a      2
    4     b      1
    5     b      0
    

    3. 根据map传入的函数对每行或每列进行转换

    • Series根据map传入的函数对每行或每列进行转换

    示例代码:

    ser_obj = pd.Series(np.random.randint(0,10,10))
    print(ser_obj)
    
    print(ser_obj.map(lambda x : x ** 2))
    

    运行结果:

    # print(ser_obj)
    0    1
    1    4
    2    8
    3    6
    4    8
    5    6
    6    6
    7    4
    8    7
    9    3
    dtype: int64
    
    # print(ser_obj.map(lambda x : x ** 2))
    0     1
    1    16
    2    64
    3    36
    4    64
    5    36
    6    36
    7    16
    8    49
    9     9
    dtype: int64
    

    二、数据替换

    replace根据值的内容进行替换

    示例代码:

    # 单个值替换单个值
    print(ser_obj.replace(1, -100))
    
    # 多个值替换一个值
    print(ser_obj.replace([6, 8], -100))
    
    # 多个值替换多个值
    print(ser_obj.replace([4, 7], [-100, -200]))
    

    运行结果:

    # print(ser_obj.replace(1, -100))
    0   -100
    1      4
    2      8
    3      6
    4      8
    5      6
    6      6
    7      4
    8      7
    9      3
    dtype: int64
    
    # print(ser_obj.replace([6, 8], -100))
    0      1
    1      4
    2   -100
    3   -100
    4   -100
    5   -100
    6   -100
    7      4
    8      7
    9      3
    dtype: int64
    
    # print(ser_obj.replace([4, 7], [-100, -200]))
    0      1
    1   -100
    2      8
    3      6
    4      8
    5      6
    6      6
    7   -100
    8   -200
    9      3
    dtype: int64
  • 相关阅读:
    hihoCoder #1176 : 欧拉路·一 (简单)
    228 Summary Ranges 汇总区间
    227 Basic Calculator II 基本计算器II
    226 Invert Binary Tree 翻转二叉树
    225 Implement Stack using Queues 队列实现栈
    224 Basic Calculator 基本计算器
    223 Rectangle Area 矩形面积
    222 Count Complete Tree Nodes 完全二叉树的节点个数
    221 Maximal Square 最大正方形
    220 Contains Duplicate III 存在重复 III
  • 原文地址:https://www.cnblogs.com/alexzhang92/p/9794037.html
Copyright © 2011-2022 走看看