zoukankan      html  css  js  c++  java
  • python数据分析——处理丢失数据

    处理丢失数据

    有两种丢失数据:

    • None
    • np.nan(NaN)

    1. None

    None是Python自带的,其类型为python object。因此,None不能参与到任何计算中。

    In [1]:
    #查看None的数据类型
    
     

    2. np.nan(NaN)

     

    np.nan是浮点类型,能参与到计算中。但计算的结果总是NaN。

    In [2]:
    #查看np.nan的数据类型

    3. pandas中的None与NaN

    1) pandas中None与np.nan都视作np.nan

    In [1]:
    import numpy as np
    import pandas as pd
    from pandas import Series,DataFrame

    创建DataFrame

    In [2]:
    df = DataFrame(data=np.random.randint(10,50,size=(8,8)))
    df
    
    Out[2]:
     01234567
    0 12 39 29 15 34 17 15 10
    1 20 13 34 20 38 22 34 15
    2 30 45 41 28 26 22 19 27
    3 37 29 47 12 37 49 28 44
    4 30 29 40 34 40 41 20 20
    5 32 49 23 38 25 13 25 40
    6 23 46 31 22 15 37 21 48
    7 41 39 35 26 46 48 33 24
    In [18]:
    #将某些数组元素赋值为nan
    
    In [4]:
    df.iloc[1,3] = None
    df.iloc[2,2] = None
    df.iloc[4,2] = None
    df.iloc[6,7] = np.nan
    
    In [5]:
    df
    
    Out[5]:
     01234567
    0 12 39 29.0 15.0 34 17 15 10.0
    1 20 13 34.0 NaN 38 22 34 15.0
    2 30 45 NaN 28.0 26 22 19 27.0
    3 37 29 47.0 12.0 37 49 28 44.0
    4 30 29 NaN 34.0 40 41 20 20.0
    5 32 49 23.0 38.0 25 13 25 40.0
    6 23 46 31.0 22.0 15 37 21 NaN
    7 41 39 35.0 26.0 46 48 33 24.0
     

    2) pandas处理空值操作

    • isnull()
    • notnull()
    • dropna(): 过滤丢失数据
    • fillna(): 填充丢失数据
    In [8]:
    df.isnull()
    
    Out[8]:
     01234567
    0 False False False False False False False False
    1 False False False True False False False False
    2 False False True False False False False False
    3 False False False False False False False False
    4 False False True False False False False False
    5 False False False False False False False False
    6 False False False False False False False True
    7 False False False False False False False False
    In [14]:
    #创建DataFrame,给其中某些元素赋值为nan
    
    df.notnull().all(axis=1)     #notnull(all)   isnull(any)
    
    Out[14]:
    0     True
    1    False
    2    False
    3     True
    4    False
    5     True
    6    False
    7     True
    dtype: bool
    In [15]:
    df.loc[df.notnull().all(axis=1)]
    
    Out[15]:
     01234567
    0 12 39 29.0 15.0 34 17 15 10.0
    3 37 29 47.0 12.0 37 49 28 44.0
    5 32 49 23.0 38.0 25 13 25 40.0
    7 41 39 35.0 26.0 46 48 33 24.0
     

    (1)判断函数

    • isnull()
    • notnull()
    • df.notnull/isnull().any()/all()
    In [4]:
    #过滤df中的空值(只保留没有空值的行)

    df.dropna() 可以选择过滤的是行还是列(默认为行):axis中0表示行,1表示的列

    In [17]:
    df.dropna(axis=0)
    
    Out[17]:
     01234567
    0 12 39 29.0 15.0 34 17 15 10.0
    3 37 29 47.0 12.0 37 49 28 44.0
    5 32 49 23.0 38.0 25 13 25 40.0
    7 41 39 35.0 26.0 46 48 33 24.0
     

    (3) 填充函数 Series/DataFrame

    • fillna():value和method参数
    In [18]:
    df
    
    Out[18]:
     01234567
    0 12 39 29.0 15.0 34 17 15 10.0
    1 20 13 34.0 NaN 38 22 34 15.0
    2 30 45 NaN 28.0 26 22 19 27.0
    3 37 29 47.0 12.0 37 49 28 44.0
    4 30 29 NaN 34.0 40 41 20 20.0
    5 32 49 23.0 38.0 25 13 25 40.0
    6 23 46 31.0 22.0 15 37 21 NaN
    7 41 39 35.0 26.0 46 48 33 24.0

    可以选择前向填充还是后向填充

    In [21]:
    df.fillna(method='ffill',axis=1)
    
    Out[21]:
     01234567
    0 12.0 39.0 29.0 15.0 34.0 17.0 15.0 10.0
    1 20.0 13.0 34.0 34.0 38.0 22.0 34.0 15.0
    2 30.0 45.0 45.0 28.0 26.0 22.0 19.0 27.0
    3 37.0 29.0 47.0 12.0 37.0 49.0 28.0 44.0
    4 30.0 29.0 29.0 34.0 40.0 41.0 20.0 20.0
    5 32.0 49.0 23.0 38.0 25.0 13.0 25.0 40.0
    6 23.0 46.0 31.0 22.0 15.0 37.0 21.0 21.0
    7 41.0 39.0 35.0 26.0 46.0 48.0 33.0 24.0

    method 控制填充的方式 bfill ffill

    ============================================

    练习7:

    1. 简述None与NaN的区别

    2. 假设张三李四参加模拟考试,但张三因为突然想明白人生放弃了英语考试,因此记为None,请据此创建一个DataFrame,命名为ddd3

    3. 老师决定根据用数学的分数填充张三的英语成绩,如何实现? 用李四的英语成绩填充张三的英语成绩?

    ============================================

    In [24]:
    data = pd.read_excel('测试数据.xlsx')
    data.drop(labels=['none1','none2'],axis=1,inplace=True)
    data
    
    Out[24]:
     time1234567
    0 2019-01-27 17:00:00 -24.8 -18.2 -20.8 -18.8 NaN NaN NaN
    1 2019-01-27 17:01:00 -23.5 -18.8 -20.5 -19.8 -15.2 -14.5 -16.0
    2 2019-01-27 17:02:00 -23.2 -19.2 NaN NaN -13.0 NaN -14.0
    3 2019-01-27 17:03:00 -22.8 -19.2 -20.0 -20.5 NaN -12.2 -9.8
    4 2019-01-27 17:04:00 -23.2 -18.5 -20.0 -18.8 -10.2 -10.8 -8.8
    5 2019-01-27 17:05:00 NaN NaN -19.0 -18.2 -10.0 -10.5 -10.8
    6 2019-01-27 17:06:00 NaN -18.5 -18.2 -17.5 NaN NaN NaN
    7 2019-01-27 17:07:00 -24.8 -18.0 -17.5 -17.2 -14.2 -14.0 -12.5
    8 2019-01-27 17:08:00 -25.2 -17.8 NaN NaN -16.2 NaN -14.5
    9 2019-01-27 17:09:00 -24.8 -18.2 NaN -17.5 NaN -15.5 -16.0
    10 2019-01-27 17:10:00 -24.5 -18.5 -16.0 -18.5 -17.5 -16.5 -17.2
    11 2019-01-27 17:11:00 NaN NaN -16.0 -18.5 -17.8 -16.8 -12.0
    12 2019-01-27 17:12:00 NaN -18.5 -15.8 -18.8 NaN NaN NaN
    13 2019-01-27 17:13:00 -23.8 -18.5 NaN NaN 4.5 NaN 0.0
    14 2019-01-27 17:14:00 -23.2 -18.2 NaN -19.0 NaN 5.8 6.8
    15 2019-01-27 17:15:00 -23.5 -17.8 -15.0 -18.0 10.5 10.5 10.8
    16 2019-01-27 17:16:00 NaN NaN -14.2 -17.2 14.0 13.5 13.0
    17 2019-01-27 17:17:00 NaN -18.2 -13.8 -17.8 15.8 15.2 14.2
    18 2019-01-27 17:18:00 -23.2 -19.0 -13.8 -18.2 NaN NaN NaN
    19 2019-01-27 17:19:00 -23.2 -19.5 NaN NaN 17.8 NaN 15.2
    20 2019-01-27 17:20:00 -23.2 -19.8 NaN -19.0 18.2 17.2 15.8
    21 2019-01-27 17:21:00 -23.5 -20.0 -13.8 -19.5 NaN 17.8 16.0
    22 2019-01-27 17:22:00 NaN NaN -14.0 -19.5 18.8 18.0 16.2
    23 2019-01-27 17:23:00 -23.2 -20.2 -14.0 -19.5 19.0 18.2 16.5
    24 2019-01-27 17:24:00 NaN -20.2 -14.2 -19.5 NaN NaN NaN
    25 2019-01-27 17:25:00 -22.8 -20.5 -14.5 -19.5 19.2 NaN 16.5
    26 2019-01-27 17:26:00 -22.8 -20.8 -15.0 -16.8 NaN 17.2 16.8
    27 2019-01-27 17:27:00 -22.0 -16.0 NaN -16.0 18.8 17.2 16.2
    28 2019-01-27 17:28:00 -22.8 -15.2 -14.8 -15.2 18.8 17.2 16.2
    29 2019-01-27 17:29:00 -22.5 -15.0 -14.8 -15.2 18.8 17.2 16.5
    ... ... ... ... ... ... ... ... ...
    1030 2019-01-28 10:10:00 -30.5 -27.5 -29.5 -27.8 -3.8 -3.5 -8.2
    1031 2019-01-28 10:11:00 -30.8 -27.0 -29.2 -27.8 -3.8 -3.2 -8.5
    1032 2019-01-28 10:12:00 -30.5 -26.2 -29.0 -26.8 -3.5 -3.0 -8.8
    1033 2019-01-28 10:13:00 -28.8 -25.2 -28.2 -26.2 -3.5 -3.0 -8.8
    1034 2019-01-28 10:14:00 -25.2 -25.2 -28.2 -25.8 -3.0 -2.5 -8.8
    1035 2019-01-28 10:15:00 -25.2 -25.8 -28.5 -26.2 -3.0 -2.2 -8.5
    1036 2019-01-28 10:16:00 -25.8 -26.2 -28.8 -26.8 -2.8 -2.0 -8.2
    1037 2019-01-28 10:17:00 -26.2 -26.8 -29.0 -27.2 -2.5 -1.8 -8.2
    1038 2019-01-28 10:18:00 -26.5 -27.0 -29.2 -27.5 NaN NaN NaN
    1039 2019-01-28 10:19:00 -27.0 -27.2 -29.5 -28.0 -2.2 -1.5 -7.8
    1040 2019-01-28 10:20:00 -26.5 -26.8 -29.0 -28.0 -2.2 -1.5 -7.5
    1041 2019-01-28 10:21:00 -25.0 -25.8 -28.5 -27.2 -2.2 -1.5 -7.5
    1042 2019-01-28 10:22:00 -24.0 -25.2 -28.2 -26.5 -2.0 -1.5 -7.2
    1043 2019-01-28 10:23:00 -23.8 -25.0 -28.0 -26.0 -2.0 -1.5 -7.2
    1044 2019-01-28 10:24:00 -24.0 -25.2 -28.0 -25.5 -2.0 -1.5 -7.0
    1045 2019-01-28 10:25:00 -25.0 -26.0 -28.2 -26.2 -2.2 -1.5 -6.8
    1046 2019-01-28 10:26:00 -25.8 -26.5 -28.8 -26.8 -2.2 -1.5 -6.5
    1047 2019-01-28 10:27:00 -26.2 -26.5 -28.8 -27.2 -2.2 -1.5 -6.5
    1048 2019-01-28 10:28:00 -25.0 -25.8 -28.5 -27.0 -2.2 -1.8 -6.2
    1049 2019-01-28 10:29:00 -24.8 -25.2 -28.0 -26.2 -2.2 -1.8 -6.0
    1050 2019-01-28 10:30:00 -24.5 -24.8 -27.8 -25.8 -2.0 -2.0 -6.0
    1051 2019-01-28 10:31:00 -24.0 -24.8 -27.8 -25.5 -2.0 -2.0 -5.8
    1052 2019-01-28 10:32:00 -24.2 -25.5 -28.0 -26.0 -2.0 -2.0 -5.5
    1053 2019-01-28 10:33:00 -25.0 -26.2 -28.2 -26.8 -2.0 -2.0 -5.2
    1054 2019-01-28 10:34:00 -25.8 -26.8 -28.5 -27.0 -2.0 -2.2 -5.2
    1055 2019-01-28 10:35:00 -26.2 -27.2 -28.8 -27.5 -2.0 NaN -5.0
    1056 2019-01-28 10:36:00 -26.8 -27.5 -29.0 -27.8 -2.2 NaN -5.0
    1057 2019-01-28 10:37:00 -27.2 -27.8 -29.0 -28.0 -2.2 NaN -5.0
    1058 2019-01-28 10:38:00 -27.5 -27.0 -29.0 -28.0 -3.5 -3.2 -5.8
    1059 2019-01-28 10:39:00 -27.0 -27.2 -29.0 -27.8 -5.0 NaN -7.0

    1060 rows × 8 columns

    In [27]:
    data.shape
    
    Out[27]:
    (1060, 8)
    In [26]:
    data.dropna(axis=0).shape
    
    Out[26]:
    (927, 8)
    In [35]:
    #检测哪些列中存在空值
    data.isnull().any(axis=0)
    
    Out[35]:
    time    False
    1       False
    2       False
    3       False
    4       False
    5       False
    6       False
    7       False
    dtype: bool
    In [34]:
    data.fillna(method='ffill',axis=0,inplace=True)
    data.fillna(method='bfill',axis=0,inplace=True)
  • 相关阅读:
    Github国内mirror加速
    通过node-inspector或VSCode调试服务器上代码
    node nvm 常见命令
    HBuilderX 修改默认的终端
    LLVM 工具使用
    LLVM 获取Value Type
    LLVM Constant Value to c++ value
    Bison 命名位置
    llvm block
    vscode use cl.exe build C/C++
  • 原文地址:https://www.cnblogs.com/bilx/p/11611830.html
Copyright © 2011-2022 走看看