zoukankan      html  css  js  c++  java
  • 数据分析与挖掘练习

    1.0 背景

    该数据集是澳大利亚某公司无人机送货的记录(2018年8月之前),主要包括以下的列:

    • 'Id' : 记录的ID
    • 'Drone Type' : 无人机的类别分 1类 2类 3类
    • 'Post Type' : 运送的类别 0为普通运送 1为速运
    • 'Package Weight' :包裹的重量
    • 'Origin Region' :出发地的区域代码
    • 'Destination Region' :目的地的区域代码
    • 'Origin Latitude' :出发纬度
    • 'Origin Longitude' :出发经度
    • 'Destination Latitude' :目的地纬度
    • 'Destination Longitude' :目的地经度
    • 'Journey Distance' :运送距离
    • 'Departure Date' :出发日期
    • 'Departure Time' :出发时间
    • 'Travel Time' :飞行时间
    • 'Delivery Time' :到达时间
    • 'Delivery Fare' :运送费用
    pd.options.display.max_rows = 10
    

    2.0 载入包和数据

    #loading library
    import pandas as pd
    import re
    import matplotlib.pyplot as plt
    #import seaborn as sns  !pip intall seaborn
    import scipy.stats as st
    import numpy as np
    import math
    from math import *
    from datetime import datetime,timedelta
    

    任务1:载入名为‘data.csv’的数据

    data = pd.read_csv('data.csv')
    

    DataFrame
    Series

    type(data)
    
    pandas.core.frame.DataFrame
    

    3.0 数据初步探索

    任务2:找出数据有多少行列

    data.shape
    
    (37903, 16)
    

    任务3:查看列的统计信息

    提示:describe()

    data.describe()
    
    Drone Type Post Type Package Weight Origin Region Destination Region Origin Latitude Origin Longitude Destination Latitude Destination Longitude Journey Distance Travel Time Delivery Fare
    count 37893.000000 37883.000000 37903.000000 37893.000000 37893.000000 37903.000000 37903.000000 37903.000000 37903.000000 37903.000000 37863.000000 37874.000000
    mean 1.699285 0.298709 25.669901 20.476684 20.452722 -37.728867 145.423058 -37.722054 145.434035 221.954150 208.794518 126.814976
    std 0.779845 0.457698 12.107150 11.501110 11.509311 1.899183 6.923993 1.895621 6.909055 116.604355 107.612447 59.314445
    min 1.000000 0.000000 5.001000 1.000000 1.000000 -39.006941 -148.337157 -39.006941 -147.691902 0.664000 7.420000 54.020000
    25% 1.000000 0.000000 15.199000 11.000000 11.000000 -38.443034 143.965002 -38.431293 143.951543 131.044500 125.165000 97.440000
    50% 2.000000 0.000000 25.446000 20.000000 20.000000 -37.707244 145.423386 -37.700695 145.450794 209.796000 196.370000 120.045000
    75% 2.000000 1.000000 35.953500 30.000000 30.000000 -37.094433 147.170334 -37.080256 147.216886 302.052000 281.250000 145.800000
    max 3.000000 1.000000 55.992000 40.000000 40.000000 38.986998 148.450576 38.989473 148.450576 556.637000 545.460000 1217.690000

    任务4:找出每个列名称

    data.columns
    
    
    Index(['Id', 'Drone Type', 'Post Type', 'Package Weight', 'Origin Region',
           'Destination Region', 'Origin Latitude', 'Origin Longitude',
           'Destination Latitude', 'Destination Longitude', 'Journey Distance',
           'Departure Date', 'Departure Time', 'Travel Time', 'Delivery Time',
           'Delivery Fare'],
          dtype='object')
    

    任务5:找出每个列的属性

    是Obeject 还是 float

    data.info()
    data['Drone Type'] = data['Drone Type'].astype('str')
    
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 37903 entries, 0 to 37902
    Data columns (total 16 columns):
    Id                       37878 non-null object
    Drone Type               37893 non-null float64
    Post Type                37883 non-null float64
    Package Weight           37903 non-null float64
    Origin Region            37893 non-null float64
    Destination Region       37893 non-null float64
    Origin Latitude          37903 non-null float64
    Origin Longitude         37903 non-null float64
    Destination Latitude     37903 non-null float64
    Destination Longitude    37903 non-null float64
    Journey Distance         37903 non-null float64
    Departure Date           37903 non-null object
    Departure Time           37903 non-null object
    Travel Time              37863 non-null float64
    Delivery Time            37903 non-null object
    Delivery Fare            37874 non-null float64
    dtypes: float64(12), object(4)
    memory usage: 4.6+ MB
    
    data.info()
    
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 37903 entries, 0 to 37902
    Data columns (total 16 columns):
    Id                       37878 non-null object
    Drone Type               37903 non-null object
    Post Type                37883 non-null float64
    Package Weight           37903 non-null float64
    Origin Region            37893 non-null float64
    Destination Region       37893 non-null float64
    Origin Latitude          37903 non-null float64
    Origin Longitude         37903 non-null float64
    Destination Latitude     37903 non-null float64
    Destination Longitude    37903 non-null float64
    Journey Distance         37903 non-null float64
    Departure Date           37903 non-null object
    Departure Time           37903 non-null object
    Travel Time              37863 non-null float64
    Delivery Time            37903 non-null object
    Delivery Fare            37874 non-null float64
    dtypes: float64(11), object(5)
    memory usage: 4.6+ MB
    

    任务6:找出数据的前5行和后5行

    data.head()
    
    
    Id Drone Type Post Type Package Weight Origin Region Destination Region Origin Latitude Origin Longitude Destination Latitude Destination Longitude Journey Distance Departure Date Departure Time Travel Time Delivery Time Delivery Fare
    0 ID1645282128 2.0 0.0 21.686 19.0 38.0 -37.089338 144.429529 -37.639134 142.891391 149.212 2018-01-16 09:38:17 140.19 11:58:28 99.25
    1 ID1697620764 nan 0.0 39.075 15.0 15.0 -38.481935 146.009567 -38.585528 146.199827 20.185 2018-02-10 04:28:17 22.84 4:51:07 149.04
    2 ID1543933503 2.0 0.0 7.243 33.0 28.0 -38.754167 144.509664 -38.242224 147.855342 296.975 2018-05-05 01:38:03 272.52 6:10:34 141.48
    3 ID1756517608 2.0 0.0 13.383 10.0 38.0 -37.240526 147.568019 -37.687178 142.991188 407.396 2018-06-11 11:43:04 371.40 17:54:27 122.82
    4 ID1832325834 2.0 0.0 8.123 1.0 8.0 -38.143985 143.798292 -38.548315 144.769228 95.974 2018-03-16 14:50:25 92.51 16:22:55 111.97
    5 ID1802448576 2.0 0.0 32.859 2.0 28.0 -37.421211 148.044072 -38.159627 148.194048 83.250 2018-05-15 16:35:50 81.12 17:56:57 113.88
    6 ID1940231408 1.0 0.0 20.616 29.0 36.0 -37.173949 143.140662 -37.021605 145.197043 183.363 2018-04-01 19:31:12 184.22 22:35:25 85.60
    7 ID1299303958 2.0 0.0 44.577 36.0 31.0 -37.123190 145.236196 -37.667199 143.877650 134.543 2018-05-01 18:39:36 127.05 20:46:38 114.22
    8 ID1752722028 1.0 0.0 15.363 20.0 30.0 -38.850561 148.317253 -38.024914 144.823938 318.132 2018-05-27 14:48:17 314.64 20:02:55 87.39
    9 ID5995243590 1.0 1.0 36.190 18.0 28.0 -38.070189 142.950207 -37.996817 148.026520 445.106 2018-06-17 12:53:02 437.52 20:10:33 142.95
    data.tail()
    
    
    Id Drone Type Post Type Package Weight Origin Region Destination Region Origin Latitude Origin Longitude Destination Latitude Destination Longitude Journey Distance Departure Date Departure Time Travel Time Delivery Time Delivery Fare
    37898 NaN 3.0 1.0 27.153 39.0 16.0 -38.446310 148.292498 -36.739777 143.604529 454.968 2018-07-23 08:29:19 366.09 14:35:24 188.49
    37899 ID5862552991 1.0 1.0 40.363 9.0 38.0 -38.983710 145.320518 -37.673908 142.879230 258.259 2018-06-26 15:55:37 256.70 20:12:18 122.98
    37900 ID5339104082 1.0 1.0 35.955 13.0 32.0 -38.292301 147.562013 -36.605285 148.293183 198.597 2018-03-19 16:41:10 198.97 20:00:08 118.47
    37901 ID5468787866 2.0 1.0 29.566 33.0 23.0 -38.853243 144.508346 -37.727691 145.662270 160.816 2018-02-26 04:22:30 150.58 6:53:04 161.96
    37902 ID1448126768 3.0 0.0 44.070 36.0 34.0 -37.129313 145.266426 -38.428477 143.341632 222.687 2018-07-07 08:01:42 182.71 11:04:24 144.41

    任务7:找出所有列的缺失值个数并且按照多到少排列

    隐藏任务:可视化缺失值的列

    data.isnull().sum().sort_values(ascending=False)
    
    
    Travel Time              40
    Delivery Fare            29
    Id                       25
    Post Type                20
    Destination Region       10
    Origin Region            10
    Delivery Time             0
    Departure Time            0
    Departure Date            0
    Journey Distance          0
    Destination Longitude     0
    Destination Latitude      0
    Origin Longitude          0
    Origin Latitude           0
    Package Weight            0
    Drone Type                0
    dtype: int64
    
    count = {}
    for col in data.columns:
        count_null = data[col].isnull().sum()
        count[col] = count_null
    for i,j in sorted(count.items(),key = lambda s: s[1], reverse=True):
        print('列名:%s,存在缺失值 %s 个'%(i,j))
    
    
    列名:Travel Time,存在缺失值 40 个
    列名:Delivery Fare,存在缺失值 29 个
    列名:Id,存在缺失值 25 个
    列名:Post Type,存在缺失值 20 个
    列名:Drone Type,存在缺失值 10 个
    列名:Origin Region,存在缺失值 10 个
    列名:Destination Region,存在缺失值 10 个
    列名:Package Weight,存在缺失值 0 个
    列名:Origin Latitude,存在缺失值 0 个
    列名:Origin Longitude,存在缺失值 0 个
    列名:Destination Latitude,存在缺失值 0 个
    列名:Destination Longitude,存在缺失值 0 个
    列名:Journey Distance,存在缺失值 0 个
    列名:Departure Date,存在缺失值 0 个
    列名:Departure Time,存在缺失值 0 个
    列名:Delivery Time,存在缺失值 0 个
    

    任务8:找出所有至少含有一个缺失值的行,并统计有多少行

    data.isnull().any(axis=1)  # 判断至少有一个缺失值
    
    
    0        False
    1         True
    2        False
    3        False
    4        False
             ...  
    37898     True
    37899    False
    37900    False
    37901    False
    37902    False
    Length: 37903, dtype: bool
    
    data.drop(data.iloc[0,2])
    
    
    Id Drone Type Post Type Package Weight Origin Region Destination Region Origin Latitude Origin Longitude Destination Latitude Destination Longitude Journey Distance Departure Date Departure Time Travel Time Delivery Time Delivery Fare
    1 ID1697620764 NaN 0.0 39.075 15.0 15.0 -38.481935 146.009567 -38.585528 146.199827 20.185 2018-02-10 04:28:17 22.84 4:51:07 149.04
    2 ID1543933503 2.0 0.0 7.243 33.0 28.0 -38.754167 144.509664 -38.242224 147.855342 296.975 2018-05-05 01:38:03 272.52 6:10:34 141.48
    3 ID1756517608 2.0 0.0 13.383 10.0 38.0 -37.240526 147.568019 -37.687178 142.991188 407.396 2018-06-11 11:43:04 371.40 17:54:27 122.82
    4 ID1832325834 2.0 0.0 8.123 1.0 8.0 -38.143985 143.798292 -38.548315 144.769228 95.974 2018-03-16 14:50:25 92.51 16:22:55 111.97
    5 ID1802448576 2.0 0.0 32.859 2.0 28.0 -37.421211 148.044072 -38.159627 148.194048 83.250 2018-05-15 16:35:50 81.12 17:56:57 113.88
    6 ID1940231408 1.0 0.0 20.616 29.0 36.0 -37.173949 143.140662 -37.021605 145.197043 183.363 2018-04-01 19:31:12 184.22 22:35:25 85.60
    7 ID1299303958 2.0 0.0 44.577 36.0 31.0 -37.123190 145.236196 -37.667199 143.877650 134.543 2018-05-01 18:39:36 127.05 20:46:38 114.22
    8 ID1752722028 1.0 0.0 15.363 20.0 30.0 -38.850561 148.317253 -38.024914 144.823938 318.132 2018-05-27 14:48:17 314.64 20:02:55 87.39
    9 ID5995243590 1.0 1.0 36.190 18.0 28.0 -38.070189 142.950207 -37.996817 148.026520 445.106 2018-06-17 12:53:02 437.52 20:10:33 142.95
    10 ID1483358088 2.0 0.0 23.172 13.0 27.0 -38.225456 147.425515 -37.642798 147.124104 70.051 2018-03-19 09:59:10 69.30 11:08:27 96.95
    11 ID1626798395 2.0 0.0 19.754 23.0 26.0 -37.625368 145.838281 -36.789955 147.133916 147.791 2018-02-28 17:40:59 138.92 19:59:54 117.48
    12 ID5277549009 3.0 1.0 12.807 4.0 6.0 -36.855984 142.929596 -36.906838 145.696986 246.465 2018-03-26 07:55:48 201.49 11:17:17 173.32
    13 ID1950928883 2.0 0.0 22.332 33.0 19.0 -38.894115 144.457143 -37.173740 144.152105 193.365 2018-06-28 16:05:55 179.73 19:05:38 117.67
    14 ID5143738648 2.0 1.0 25.880 33.0 5.0 -38.872372 144.606034 -37.553304 145.120753 153.580 2018-05-09 08:33:29 144.10 10:57:34 132.98
    15 ID5132897910 2.0 1.0 38.691 7.0 15.0 -38.844622 144.093195 -38.476630 145.849992 158.102 2018-01-05 15:55:00 148.15 18:23:09 147.73
    16 ID1290889802 1.0 0.0 30.742 19.0 14.0 -37.178059 144.403991 -37.713867 146.382965 184.783 2018-06-05 13:03:43 185.60 16:09:18 84.82
    17 ID5226355535 1.0 1.0 18.055 10.0 18.0 -37.141728 147.256091 -37.983127 143.290272 362.227 2018-05-04 07:42:51 357.32 13:40:10 116.35
    18 ID1898978312 1.0 0.0 5.986 23.0 18.0 -37.706960 145.718119 -37.983127 143.182464 224.997 2018-05-23 12:53:12 224.51 16:37:42 82.47
    19 ID5284908619 2.0 1.0 30.664 13.0 14.0 -38.250238 147.366610 -37.774617 146.503568 92.371 2018-07-11 12:52:57 89.29 14:22:14 143.96
    20 ID1585556406 3.0 0.0 24.942 3.0 39.0 -38.322643 145.505910 -38.453191 148.300405 244.251 2018-03-14 22:16:33 199.74 1:36:17 167.45
    21 ID1901962779 1.0 0.0 34.012 27.0 37.0 -37.516090 146.969053 -38.852488 147.816987 166.236 2018-07-17 16:26:55 167.65 19:14:34 88.98
    22 ID5590279060 1.0 1.0 55.229 7.0 6.0 -38.876145 143.911302 -36.875012 145.759049 275.633 2018-05-01 12:42:19 273.52 17:15:50 688.24
    23 ID1473718059 3.0 0.0 40.741 33.0 17.0 -38.756634 144.375564 -38.817727 147.071495 234.014 2018-06-21 01:04:15 191.66 4:15:54 165.37
    24 ID5551646734 3.0 1.0 15.419 35.0 21.0 -36.922413 146.362349 -37.240137 143.768208 233.068 2018-01-05 07:19:48 190.91 10:30:42 174.52
    25 ID1772122934 2.0 0.0 10.663 29.0 1.0 -37.019910 142.798295 -38.358124 143.943955 179.929 2018-02-27 04:42:03 167.70 7:29:44 124.95
    26 ID1987608852 1.0 0.0 20.685 9.0 35.0 -38.970092 145.435801 -37.068624 146.317717 225.349 2018-05-15 15:35:10 224.85 19:20:00 80.66
    27 ID1249352358 2.0 0.0 10.272 8.0 34.0 -38.485641 144.522135 -38.505484 143.311788 105.471 2018-02-05 06:13:29 101.02 7:54:30 102.09
    28 ID1611614450 1.0 0.0 7.373 16.0 38.0 -36.600877 143.566811 -37.804113 142.793139 150.483 2018-06-18 14:01:42 152.40 16:34:06 82.11
    29 ID1262379299 2.0 0.0 31.358 13.0 16.0 -38.195911 147.436921 -36.734698 143.764279 362.941 2018-04-24 22:40:54 331.59 4:12:29 147.21
    30 ID5498216777 2.0 1.0 6.383 40.0 29.0 -37.692167 147.890721 -37.273398 142.951899 438.698 2018-05-09 10:31:10 399.43 17:10:35 152.50
    ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
    37873 NaN 1.0 0.0 39.038 7.0 22.0 -38.664403 143.702592 -36.670478 144.312065 228.361 2018-02-02 18:03:25 227.77 21:51:11 87.87
    37874 NaN 1.0 0.0 7.171 14.0 1.0 -37.589382 146.559422 -38.229669 143.710836 260.125 2018-02-03 22:11:31 258.51 2:30:01 106.47
    37875 NaN 1.0 0.0 39.141 5.0 1.0 -37.491746 145.168879 -38.168680 143.936761 131.959 2018-04-19 22:54:02 134.48 1:08:30 94.83
    37876 NaN 2.0 0.0 42.330 25.0 5.0 -36.618297 147.643165 -37.563856 145.038804 254.069 2018-01-27 11:06:09 234.09 15:00:14 116.10
    37877 NaN 1.0 1.0 43.965 25.0 38.0 -36.542082 147.863736 -37.549355 142.980000 448.104 2018-06-14 22:15:50 440.42 5:36:15 149.26
    37878 ID1525565031 2.0 0.0 27.560 7.0 35.0 -38.689638 144.093996 -37.046945 146.460389 276.894 2018-02-22 23:43:48 254.53 3:58:19 135.68
    37879 NaN 1.0 0.0 18.540 36.0 18.0 -36.974023 145.036046 -38.015640 142.955101 217.299 2018-05-28 07:14:50 217.06 10:51:53 76.31
    37880 ID1272392458 1.0 0.0 44.226 40.0 3.0 -37.740012 147.768225 -38.276119 145.403536 215.813 2018-02-21 21:28:22 215.63 1:03:59 105.89
    37881 ID1909131399 2.0 0.0 6.415 25.0 27.0 -36.585246 147.853062 -37.700285 147.027318 144.135 2018-01-16 08:26:33 135.64 10:42:11 100.45
    37882 ID5420355345 3.0 1.0 30.471 3.0 15.0 -38.345519 145.542105 -38.712933 145.977537 55.772 2018-07-22 00:13:54 50.94 1:04:50 183.73
    37883 ID5164753016 2.0 1.0 6.553 5.0 15.0 -37.625311 145.097611 -38.630132 146.040376 139.018 2018-01-14 16:51:56 131.06 19:02:59 145.71
    37884 ID1122103211 1.0 0.0 37.850 38.0 10.0 -37.510578 142.826103 -37.337327 147.315144 397.278 2018-04-27 14:54:35 391.24 21:25:49 102.88
    37885 NaN 1.0 0.0 13.544 34.0 38.0 -38.528632 143.386140 -37.561137 142.835312 118.028 2018-04-09 11:48:46 120.99 13:49:45 69.88
    37886 ID1970402579 2.0 0.0 38.656 1.0 32.0 -38.171542 143.871582 -36.756484 148.322553 423.585 2018-03-18 01:01:33 385.90 7:27:26 142.19
    37887 ID1388385049 1.0 0.0 23.699 10.0 10.0 -37.352794 147.476662 -37.104700 147.417027 28.118 2018-05-24 00:41:08 33.99 1:15:07 83.77
    37888 NaN 1.0 1.0 34.923 36.0 28.0 -37.044972 144.920592 -38.218799 148.050871 305.307 2018-07-20 03:13:54 302.23 8:16:07 130.70
    37889 ID1281653747 1.0 0.0 34.130 8.0 36.0 -38.434373 144.730220 -37.047801 145.309517 162.554 2018-02-06 06:27:55 164.08 9:11:59 69.44
    37890 ID5349085772 1.0 1.0 18.286 22.0 14.0 -36.720129 144.588398 -37.695893 146.430793 196.154 2018-02-15 17:04:44 196.60 20:21:19 121.11
    37891 ID5972337482 1.0 1.0 11.538 33.0 2.0 -38.836802 144.357057 -37.549931 148.306793 374.024 2018-05-26 02:07:50 368.73 8:16:33 141.06
    37892 NaN 3.0 1.0 5.416 20.0 33.0 -38.959090 148.294700 -38.930545 144.661887 314.514 2018-04-27 20:53:02 255.21 1:08:14 185.96
    37893 ID1539650034 2.0 0.0 34.355 8.0 39.0 -38.520278 144.408786 -38.447195 148.416066 349.251 2018-05-09 07:26:49 319.33 12:46:08 121.30
    37894 NaN 2.0 0.0 41.232 38.0 39.0 -37.657406 142.777301 -38.622040 148.366529 500.901 2018-07-02 08:59:29 455.14 16:34:37 139.79
    37895 ID1796943211 1.0 0.0 44.341 23.0 24.0 -37.777223 146.024184 -38.913981 142.913934 299.552 2018-02-20 05:08:12 296.66 10:04:51 113.70
    37896 ID5429883749 2.0 1.0 17.798 11.0 40.0 -38.045551 146.736254 -37.633007 147.639273 91.711 2018-05-03 10:19:32 88.70 11:48:14 130.51
    37897 NaN 1.0 0.0 8.865 9.0 2.0 -38.839254 145.226776 -37.695101 148.251214 293.394 2018-03-11 12:18:21 290.70 17:09:02 88.71
    37898 NaN 3.0 1.0 27.153 39.0 16.0 -38.446310 148.292498 -36.739777 143.604529 454.968 2018-07-23 08:29:19 366.09 14:35:24 188.49
    37899 ID5862552991 1.0 1.0 40.363 9.0 38.0 -38.983710 145.320518 -37.673908 142.879230 258.259 2018-06-26 15:55:37 256.70 20:12:18 122.98
    37900 ID5339104082 1.0 1.0 35.955 13.0 32.0 -38.292301 147.562013 -36.605285 148.293183 198.597 2018-03-19 16:41:10 198.97 20:00:08 118.47
    37901 ID5468787866 2.0 1.0 29.566 33.0 23.0 -38.853243 144.508346 -37.727691 145.662270 160.816 2018-02-26 04:22:30 150.58 6:53:04 161.96
    37902 ID1448126768 3.0 0.0 44.070 36.0 34.0 -37.129313 145.266426 -38.428477 143.341632 222.687 2018-07-07 08:01:42 182.71 11:04:24 144.41

    37902 rows × 16 columns

    # axis=1针对的是行;=0针对的是列
    data[data.isnull().any(axis=1)].shape
    #(data.isnull().sum(axis=1) >= 1).sum()
    data[data.isnull().any(axis=1)].shape
    data.isnull().any(axis=1)
    
    
    0        False
    1         True
    2        False
    3        False
    4        False
    5        False
    6        False
    7        False
    8        False
    9        False
    10       False
    11       False
    12       False
    13       False
    14       False
    15       False
    16       False
    17       False
    18       False
    19       False
    20       False
    21       False
    22       False
    23       False
    24       False
    25       False
    26       False
    27       False
    28       False
    29       False
             ...  
    37873     True
    37874     True
    37875     True
    37876     True
    37877     True
    37878    False
    37879     True
    37880    False
    37881    False
    37882    False
    37883    False
    37884    False
    37885     True
    37886    False
    37887    False
    37888     True
    37889    False
    37890    False
    37891    False
    37892     True
    37893    False
    37894     True
    37895    False
    37896    False
    37897     True
    37898     True
    37899    False
    37900    False
    37901    False
    37902    False
    dtype: bool
    

    4.0 数据清洗

    任务9: 填补 'Id'列的空值

    任务9.1 统计‘id’列有多少个空值

    data['Id'].isnull().sum()
    
    
    25
    

    任务9.2 找出所有‘id’为空的行

    data[data['Id'].isnull()]
    
    
    Id Drone Type Post Type Package Weight Origin Region Destination Region Origin Latitude Origin Longitude Destination Latitude Destination Longitude Journey Distance Departure Date Departure Time Travel Time Delivery Time Delivery Fare
    37844 NaN 1.0 0.0 22.498 30.0 15.0 -37.885792 144.875305 -38.680341 145.874064 124.252 2018-01-28 13:07:09 127.02 15:14:10 74.69
    37845 NaN 3.0 0.0 32.300 16.0 36.0 -36.571169 143.741010 -36.993435 144.983048 120.297 2018-03-26 02:34:49 101.88 4:16:41 162.20
    37846 NaN 3.0 0.0 18.601 38.0 29.0 -37.694132 142.851548 -37.058014 142.873698 70.838 2018-03-02 20:33:11 62.83 21:36:00 142.10
    37850 NaN 2.0 1.0 28.203 18.0 21.0 -38.139260 143.345778 -37.279348 143.604189 98.391 2018-04-17 03:28:41 94.68 5:03:21 152.30
    37851 NaN 2.0 1.0 45.696 28.0 1.0 -38.152835 147.793072 -38.162103 144.043047 328.220 2018-03-25 13:50:57 300.50 18:51:27 167.39
    37852 NaN 1.0 0.0 27.143 38.0 32.0 -37.613135 142.854194 -36.713765 148.383062 500.500 2018-05-12 08:54:52 491.13 17:05:59 97.44
    37854 NaN 1.0 0.0 13.002 27.0 3.0 -37.428594 147.056992 -38.383631 145.590528 167.005 2018-02-20 03:35:43 168.39 6:24:06 92.12
    37857 NaN 1.0 0.0 19.468 5.0 31.0 -37.570514 145.253281 -37.824303 143.862472 125.716 2018-04-20 11:48:14 128.43 13:56:39 69.86
    37860 NaN 1.0 0.0 10.647 21.0 5.0 -37.239046 143.524656 -37.546884 145.204154 152.434 2018-06-13 23:05:39 154.29 1:39:56 95.65
    37861 NaN 2.0 0.0 40.775 22.0 19.0 -36.678761 144.345069 -37.167099 144.257043 54.922 2018-03-27 04:31:43 55.75 5:27:28 120.70
    37863 NaN 1.0 0.0 35.410 9.0 16.0 -39.006941 145.406988 -36.525200 143.517906 322.399 2018-07-21 06:48:39 318.77 12:07:25 88.14
    37866 NaN 1.0 1.0 40.151 14.0 34.0 -37.644061 146.625820 -38.518928 143.369714 301.447 2018-04-14 07:29:13 298.50 12:27:43 118.89
    37869 NaN 3.0 1.0 44.559 33.0 10.0 -38.733722 144.474460 -37.220132 147.498884 314.323 2018-04-18 04:57:37 255.06 9:12:40 201.89
    37873 NaN 1.0 0.0 39.038 7.0 22.0 -38.664403 143.702592 -36.670478 144.312065 228.361 2018-02-02 18:03:25 227.77 21:51:11 87.87
    37874 NaN 1.0 0.0 7.171 14.0 1.0 -37.589382 146.559422 -38.229669 143.710836 260.125 2018-02-03 22:11:31 258.51 2:30:01 106.47
    37875 NaN 1.0 0.0 39.141 5.0 1.0 -37.491746 145.168879 -38.168680 143.936761 131.959 2018-04-19 22:54:02 134.48 1:08:30 94.83
    37876 NaN 2.0 0.0 42.330 25.0 5.0 -36.618297 147.643165 -37.563856 145.038804 254.069 2018-01-27 11:06:09 234.09 15:00:14 116.10
    37877 NaN 1.0 1.0 43.965 25.0 38.0 -36.542082 147.863736 -37.549355 142.980000 448.104 2018-06-14 22:15:50 440.42 5:36:15 149.26
    37879 NaN 1.0 0.0 18.540 36.0 18.0 -36.974023 145.036046 -38.015640 142.955101 217.299 2018-05-28 07:14:50 217.06 10:51:53 76.31
    37885 NaN 1.0 0.0 13.544 34.0 38.0 -38.528632 143.386140 -37.561137 142.835312 118.028 2018-04-09 11:48:46 120.99 13:49:45 69.88
    37888 NaN 1.0 1.0 34.923 36.0 28.0 -37.044972 144.920592 -38.218799 148.050871 305.307 2018-07-20 03:13:54 302.23 8:16:07 130.70
    37892 NaN 3.0 1.0 5.416 20.0 33.0 -38.959090 148.294700 -38.930545 144.661887 314.514 2018-04-27 20:53:02 255.21 1:08:14 185.96
    37894 NaN 2.0 0.0 41.232 38.0 39.0 -37.657406 142.777301 -38.622040 148.366529 500.901 2018-07-02 08:59:29 455.14 16:34:37 139.79
    37897 NaN 1.0 0.0 8.865 9.0 2.0 -38.839254 145.226776 -37.695101 148.251214 293.394 2018-03-11 12:18:21 290.70 17:09:02 88.71
    37898 NaN 3.0 1.0 27.153 39.0 16.0 -38.446310 148.292498 -36.739777 143.604529 454.968 2018-07-23 08:29:19 366.09 14:35:24 188.49

    9.2.1 删除 除ID列之外其余数据重复的行

    data[
        ['Drone Type', 'Post Type', 'Package Weight', 'Origin Region',
           'Destination Region', 'Origin Latitude', 'Origin Longitude',
           'Destination Latitude', 'Destination Longitude', 'Journey Distance',
           'Departure Date', 'Departure Time', 'Travel Time', 'Delivery Time',
           'Delivery Fare']
    ]
    
    
    Drone Type Post Type Package Weight Origin Region Destination Region Origin Latitude Origin Longitude Destination Latitude Destination Longitude Journey Distance Departure Date Departure Time Travel Time Delivery Time Delivery Fare
    0 2.0 0.0 21.686 19.0 38.0 -37.089338 144.429529 -37.639134 142.891391 149.212 2018-01-16 09:38:17 140.19 11:58:28 99.25
    1 NaN 0.0 39.075 15.0 15.0 -38.481935 146.009567 -38.585528 146.199827 20.185 2018-02-10 04:28:17 22.84 4:51:07 149.04
    2 2.0 0.0 7.243 33.0 28.0 -38.754167 144.509664 -38.242224 147.855342 296.975 2018-05-05 01:38:03 272.52 6:10:34 141.48
    3 2.0 0.0 13.383 10.0 38.0 -37.240526 147.568019 -37.687178 142.991188 407.396 2018-06-11 11:43:04 371.40 17:54:27 122.82
    4 2.0 0.0 8.123 1.0 8.0 -38.143985 143.798292 -38.548315 144.769228 95.974 2018-03-16 14:50:25 92.51 16:22:55 111.97
    5 2.0 0.0 32.859 2.0 28.0 -37.421211 148.044072 -38.159627 148.194048 83.250 2018-05-15 16:35:50 81.12 17:56:57 113.88
    6 1.0 0.0 20.616 29.0 36.0 -37.173949 143.140662 -37.021605 145.197043 183.363 2018-04-01 19:31:12 184.22 22:35:25 85.60
    7 2.0 0.0 44.577 36.0 31.0 -37.123190 145.236196 -37.667199 143.877650 134.543 2018-05-01 18:39:36 127.05 20:46:38 114.22
    8 1.0 0.0 15.363 20.0 30.0 -38.850561 148.317253 -38.024914 144.823938 318.132 2018-05-27 14:48:17 314.64 20:02:55 87.39
    9 1.0 1.0 36.190 18.0 28.0 -38.070189 142.950207 -37.996817 148.026520 445.106 2018-06-17 12:53:02 437.52 20:10:33 142.95
    10 2.0 0.0 23.172 13.0 27.0 -38.225456 147.425515 -37.642798 147.124104 70.051 2018-03-19 09:59:10 69.30 11:08:27 96.95
    11 2.0 0.0 19.754 23.0 26.0 -37.625368 145.838281 -36.789955 147.133916 147.791 2018-02-28 17:40:59 138.92 19:59:54 117.48
    12 3.0 1.0 12.807 4.0 6.0 -36.855984 142.929596 -36.906838 145.696986 246.465 2018-03-26 07:55:48 201.49 11:17:17 173.32
    13 2.0 0.0 22.332 33.0 19.0 -38.894115 144.457143 -37.173740 144.152105 193.365 2018-06-28 16:05:55 179.73 19:05:38 117.67
    14 2.0 1.0 25.880 33.0 5.0 -38.872372 144.606034 -37.553304 145.120753 153.580 2018-05-09 08:33:29 144.10 10:57:34 132.98
    15 2.0 1.0 38.691 7.0 15.0 -38.844622 144.093195 -38.476630 145.849992 158.102 2018-01-05 15:55:00 148.15 18:23:09 147.73
    16 1.0 0.0 30.742 19.0 14.0 -37.178059 144.403991 -37.713867 146.382965 184.783 2018-06-05 13:03:43 185.60 16:09:18 84.82
    17 1.0 1.0 18.055 10.0 18.0 -37.141728 147.256091 -37.983127 143.290272 362.227 2018-05-04 07:42:51 357.32 13:40:10 116.35
    18 1.0 0.0 5.986 23.0 18.0 -37.706960 145.718119 -37.983127 143.182464 224.997 2018-05-23 12:53:12 224.51 16:37:42 82.47
    19 2.0 1.0 30.664 13.0 14.0 -38.250238 147.366610 -37.774617 146.503568 92.371 2018-07-11 12:52:57 89.29 14:22:14 143.96
    20 3.0 0.0 24.942 3.0 39.0 -38.322643 145.505910 -38.453191 148.300405 244.251 2018-03-14 22:16:33 199.74 1:36:17 167.45
    21 1.0 0.0 34.012 27.0 37.0 -37.516090 146.969053 -38.852488 147.816987 166.236 2018-07-17 16:26:55 167.65 19:14:34 88.98
    22 1.0 1.0 55.229 7.0 6.0 -38.876145 143.911302 -36.875012 145.759049 275.633 2018-05-01 12:42:19 273.52 17:15:50 688.24
    23 3.0 0.0 40.741 33.0 17.0 -38.756634 144.375564 -38.817727 147.071495 234.014 2018-06-21 01:04:15 191.66 4:15:54 165.37
    24 3.0 1.0 15.419 35.0 21.0 -36.922413 146.362349 -37.240137 143.768208 233.068 2018-01-05 07:19:48 190.91 10:30:42 174.52
    25 2.0 0.0 10.663 29.0 1.0 -37.019910 142.798295 -38.358124 143.943955 179.929 2018-02-27 04:42:03 167.70 7:29:44 124.95
    26 1.0 0.0 20.685 9.0 35.0 -38.970092 145.435801 -37.068624 146.317717 225.349 2018-05-15 15:35:10 224.85 19:20:00 80.66
    27 2.0 0.0 10.272 8.0 34.0 -38.485641 144.522135 -38.505484 143.311788 105.471 2018-02-05 06:13:29 101.02 7:54:30 102.09
    28 1.0 0.0 7.373 16.0 38.0 -36.600877 143.566811 -37.804113 142.793139 150.483 2018-06-18 14:01:42 152.40 16:34:06 82.11
    29 2.0 0.0 31.358 13.0 16.0 -38.195911 147.436921 -36.734698 143.764279 362.941 2018-04-24 22:40:54 331.59 4:12:29 147.21
    ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
    37873 1.0 0.0 39.038 7.0 22.0 -38.664403 143.702592 -36.670478 144.312065 228.361 2018-02-02 18:03:25 227.77 21:51:11 87.87
    37874 1.0 0.0 7.171 14.0 1.0 -37.589382 146.559422 -38.229669 143.710836 260.125 2018-02-03 22:11:31 258.51 2:30:01 106.47
    37875 1.0 0.0 39.141 5.0 1.0 -37.491746 145.168879 -38.168680 143.936761 131.959 2018-04-19 22:54:02 134.48 1:08:30 94.83
    37876 2.0 0.0 42.330 25.0 5.0 -36.618297 147.643165 -37.563856 145.038804 254.069 2018-01-27 11:06:09 234.09 15:00:14 116.10
    37877 1.0 1.0 43.965 25.0 38.0 -36.542082 147.863736 -37.549355 142.980000 448.104 2018-06-14 22:15:50 440.42 5:36:15 149.26
    37878 2.0 0.0 27.560 7.0 35.0 -38.689638 144.093996 -37.046945 146.460389 276.894 2018-02-22 23:43:48 254.53 3:58:19 135.68
    37879 1.0 0.0 18.540 36.0 18.0 -36.974023 145.036046 -38.015640 142.955101 217.299 2018-05-28 07:14:50 217.06 10:51:53 76.31
    37880 1.0 0.0 44.226 40.0 3.0 -37.740012 147.768225 -38.276119 145.403536 215.813 2018-02-21 21:28:22 215.63 1:03:59 105.89
    37881 2.0 0.0 6.415 25.0 27.0 -36.585246 147.853062 -37.700285 147.027318 144.135 2018-01-16 08:26:33 135.64 10:42:11 100.45
    37882 3.0 1.0 30.471 3.0 15.0 -38.345519 145.542105 -38.712933 145.977537 55.772 2018-07-22 00:13:54 50.94 1:04:50 183.73
    37883 2.0 1.0 6.553 5.0 15.0 -37.625311 145.097611 -38.630132 146.040376 139.018 2018-01-14 16:51:56 131.06 19:02:59 145.71
    37884 1.0 0.0 37.850 38.0 10.0 -37.510578 142.826103 -37.337327 147.315144 397.278 2018-04-27 14:54:35 391.24 21:25:49 102.88
    37885 1.0 0.0 13.544 34.0 38.0 -38.528632 143.386140 -37.561137 142.835312 118.028 2018-04-09 11:48:46 120.99 13:49:45 69.88
    37886 2.0 0.0 38.656 1.0 32.0 -38.171542 143.871582 -36.756484 148.322553 423.585 2018-03-18 01:01:33 385.90 7:27:26 142.19
    37887 1.0 0.0 23.699 10.0 10.0 -37.352794 147.476662 -37.104700 147.417027 28.118 2018-05-24 00:41:08 33.99 1:15:07 83.77
    37888 1.0 1.0 34.923 36.0 28.0 -37.044972 144.920592 -38.218799 148.050871 305.307 2018-07-20 03:13:54 302.23 8:16:07 130.70
    37889 1.0 0.0 34.130 8.0 36.0 -38.434373 144.730220 -37.047801 145.309517 162.554 2018-02-06 06:27:55 164.08 9:11:59 69.44
    37890 1.0 1.0 18.286 22.0 14.0 -36.720129 144.588398 -37.695893 146.430793 196.154 2018-02-15 17:04:44 196.60 20:21:19 121.11
    37891 1.0 1.0 11.538 33.0 2.0 -38.836802 144.357057 -37.549931 148.306793 374.024 2018-05-26 02:07:50 368.73 8:16:33 141.06
    37892 3.0 1.0 5.416 20.0 33.0 -38.959090 148.294700 -38.930545 144.661887 314.514 2018-04-27 20:53:02 255.21 1:08:14 185.96
    37893 2.0 0.0 34.355 8.0 39.0 -38.520278 144.408786 -38.447195 148.416066 349.251 2018-05-09 07:26:49 319.33 12:46:08 121.30
    37894 2.0 0.0 41.232 38.0 39.0 -37.657406 142.777301 -38.622040 148.366529 500.901 2018-07-02 08:59:29 455.14 16:34:37 139.79
    37895 1.0 0.0 44.341 23.0 24.0 -37.777223 146.024184 -38.913981 142.913934 299.552 2018-02-20 05:08:12 296.66 10:04:51 113.70
    37896 2.0 1.0 17.798 11.0 40.0 -38.045551 146.736254 -37.633007 147.639273 91.711 2018-05-03 10:19:32 88.70 11:48:14 130.51
    37897 1.0 0.0 8.865 9.0 2.0 -38.839254 145.226776 -37.695101 148.251214 293.394 2018-03-11 12:18:21 290.70 17:09:02 88.71
    37898 3.0 1.0 27.153 39.0 16.0 -38.446310 148.292498 -36.739777 143.604529 454.968 2018-07-23 08:29:19 366.09 14:35:24 188.49
    37899 1.0 1.0 40.363 9.0 38.0 -38.983710 145.320518 -37.673908 142.879230 258.259 2018-06-26 15:55:37 256.70 20:12:18 122.98
    37900 1.0 1.0 35.955 13.0 32.0 -38.292301 147.562013 -36.605285 148.293183 198.597 2018-03-19 16:41:10 198.97 20:00:08 118.47
    37901 2.0 1.0 29.566 33.0 23.0 -38.853243 144.508346 -37.727691 145.662270 160.816 2018-02-26 04:22:30 150.58 6:53:04 161.96
    37902 3.0 0.0 44.070 36.0 34.0 -37.129313 145.266426 -38.428477 143.341632 222.687 2018-07-07 08:01:42 182.71 11:04:24 144.41

    37903 rows × 15 columns

    # drop_duplicates返回一个dataframe,重复的行会标为False
    # data_1 = data.drop_duplicates(data[['Drone Type', 'Post Type', 'Package Weight', 'Origin Region',
    #        'Destination Region', 'Origin Latitude', 'Origin Longitude',
    #        'Destination Latitude', 'Destination Longitude', 'Journey Distance',
    #        'Departure Date', 'Departure Time', 'Travel Time', 'Delivery Time',
    #        'Delivery Fare']])
    # 2
    data_1 = data.drop_duplicates(['Drone Type', 'Post Type', 'Package Weight', 'Origin Region',
        'Destination Region', 'Origin Latitude', 'Origin Longitude',
        'Destination Latitude', 'Destination Longitude', 'Journey Distance',
        'Departure Date', 'Departure Time', 'Travel Time', 'Delivery Time',
        'Delivery Fare'])
    data_1
    
    
    Id Drone Type Post Type Package Weight Origin Region Destination Region Origin Latitude Origin Longitude Destination Latitude Destination Longitude Journey Distance Departure Date Departure Time Travel Time Delivery Time Delivery Fare
    0 ID1645282128 2.0 0.0 21.686 19.0 38.0 -37.089338 144.429529 -37.639134 142.891391 149.212 2018-01-16 09:38:17 140.19 11:58:28 99.25
    1 ID1697620764 NaN 0.0 39.075 15.0 15.0 -38.481935 146.009567 -38.585528 146.199827 20.185 2018-02-10 04:28:17 22.84 4:51:07 149.04
    2 ID1543933503 2.0 0.0 7.243 33.0 28.0 -38.754167 144.509664 -38.242224 147.855342 296.975 2018-05-05 01:38:03 272.52 6:10:34 141.48
    3 ID1756517608 2.0 0.0 13.383 10.0 38.0 -37.240526 147.568019 -37.687178 142.991188 407.396 2018-06-11 11:43:04 371.40 17:54:27 122.82
    4 ID1832325834 2.0 0.0 8.123 1.0 8.0 -38.143985 143.798292 -38.548315 144.769228 95.974 2018-03-16 14:50:25 92.51 16:22:55 111.97
    ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
    37839 ID1879423081 2.0 0.0 17.081 20.0 14.0 -38.860358 148.174855 -37.562224 146.443991 209.278 2018-02-15 12:09:15 193.98 15:23:13 121.33
    37840 ID5705840841 1.0 1.0 35.164 29.0 23.0 -37.250331 142.837244 -37.739285 145.963420 281.403 2018-01-01 00:07:35 279.10 4:46:41 136.58
    37841 ID1276239209 3.0 0.0 36.704 12.0 11.0 -36.578568 145.273744 -38.305557 146.997297 245.269 2018-01-24 06:48:05 200.54 10:08:37 146.69
    37842 ID1432868583 3.0 0.0 13.195 37.0 21.0 -38.755802 147.744770 -37.487866 143.585293 390.599 2018-04-07 18:57:52 315.28 0:13:08 170.52
    37889 ID1281653747 1.0 0.0 34.130 8.0 36.0 -38.434373 144.730220 -37.047801 145.309517 162.554 2018-02-06 06:27:55 164.08 9:11:59 69.44

    37844 rows × 16 columns

    9.2.2 设置ID列相同,但其余数据不重复的ID为np.nan

    # 返回一个布尔型的series,表示id是否重复行
    data_1['Id'].duplicated()
    
    
    0        False
    1        False
    2        False
    3        False
    4        False
             ...  
    37839    False
    37840    False
    37841    False
    37842    False
    37889     True
    Name: Id, Length: 37844, dtype: bool
    
    (data_1['Id'].duplicated()) & ~(data_1[['Drone Type', 'Post Type', 'Package Weight', 'Origin Region',
           'Destination Region', 'Origin Latitude', 'Origin Longitude',
           'Destination Latitude', 'Destination Longitude', 'Journey Distance',
           'Departure Date', 'Departure Time', 'Travel Time', 'Delivery Time',
           'Delivery Fare']].duplicated())
    
    
    0        False
    1        False
    2        False
    3        False
    4        False
             ...  
    37839    False
    37840    False
    37841    False
    37842    False
    37889     True
    Length: 37844, dtype: bool
    
    data_1.loc[(data_1['Id'].duplicated()) & ~(data_1[['Drone Type', 'Post Type', 'Package Weight', 'Origin Region',
           'Destination Region', 'Origin Latitude', 'Origin Longitude',
           'Destination Latitude', 'Destination Longitude', 'Journey Distance',
           'Departure Date', 'Departure Time', 'Travel Time', 'Delivery Time',
           'Delivery Fare']].duplicated()),'Id'] = np.nan
    
    
    data_1['Id'].value_counts()
    
    
    ID1858018960    1
    ID1792956453    1
    ID1574575344    1
    ID1146475421    1
    ID1897230811    1
                   ..
    ID5349642032    1
    ID5451065609    1
    ID5131869013    1
    ID1550669799    1
    ID5260585195    1
    Name: Id, Length: 37843, dtype: int64
    

    任务9.3 想出空值的填补的2种方法,并且选一种应用

    #方法1 :随机数
    import random
    # 'Id'+str(random.randint(1000000000,2000000000)) in data_1['Id']
    #方法2 :随机数填补
    
    def fill_id(Id):
        Id_out = Id
        while Id_out in data_1['Id'].tolist():
            Id_out = 'ID'+str(random.randint(1000000000,2000000000))
            return Id_out
    data_1.loc[(data_1['Post Type']==0)&(data_1['Id'].isnull()),'Id'].apply(fill_id)
    
    
    37889    ID1117118084
    Name: Id, dtype: object
    
    data_1.loc[(data_1['Post Type']==0)&(data_1['Id'].isnull()),'Id']
    
    
    37889    NaN
    Name: Id, dtype: object
    
    data_1['Id']
    
    
    0        ID1645282128
    1        ID1697620764
    2        ID1543933503
    3        ID1756517608
    4        ID1832325834
                 ...     
    37839    ID1879423081
    37840    ID5705840841
    37841    ID1276239209
    37842    ID1432868583
    37889             NaN
    Name: Id, Length: 37844, dtype: object
    

    任务9.4 检查‘id’是否还有空值

    data_1['Id'].isnull().sum()
    
    
    1
    

    任务10:找出所有重复的id

    data[data.Id.duplicated()].Id
    
    
    37843    ID1874340610
    37845             NaN
    37846             NaN
    37847    ID5156350605
    37848    ID1176413101
                 ...     
    37898             NaN
    37899    ID5862552991
    37900    ID5339104082
    37901    ID5468787866
    37902    ID1448126768
    Name: Id, Length: 59, dtype: object
    
    data['Id'].value_counts() >= 2
    
    
    ID5281864060     True
    ID1796943211     True
    ID1877344172     True
    ID5122284320     True
    ID1238297934     True
                    ...  
    ID5672029782    False
    ID1114364309    False
    ID5495523518    False
    ID1874532678    False
    ID5260585195    False
    Name: Id, Length: 37843, dtype: bool
    

    任务11:删除重复行

    data.duplicated()
    
    
    0        False
    1        False
    2        False
    3        False
    4        False
    5        False
    6        False
    7        False
    8        False
    9        False
    10       False
    11       False
    12       False
    13       False
    14       False
    15       False
    16       False
    17       False
    18       False
    19       False
    20       False
    21       False
    22       False
    23       False
    24       False
    25       False
    26       False
    27       False
    28       False
    29       False
             ...  
    37873    False
    37874    False
    37875    False
    37876    False
    37877    False
    37878     True
    37879    False
    37880     True
    37881     True
    37882     True
    37883     True
    37884     True
    37885    False
    37886     True
    37887     True
    37888    False
    37889    False
    37890     True
    37891     True
    37892    False
    37893     True
    37894    False
    37895     True
    37896     True
    37897    False
    37898    False
    37899     True
    37900     True
    37901     True
    37902     True
    dtype: bool
    
    data.drop_duplicates()
    
    
    Id Drone Type Post Type Package Weight Origin Region Destination Region Origin Latitude Origin Longitude Destination Latitude Destination Longitude Journey Distance Departure Date Departure Time Travel Time Delivery Time Delivery Fare
    0 ID1645282128 2.0 0.0 21.686 19.0 38.0 -37.089338 144.429529 -37.639134 142.891391 149.212 2018-01-16 09:38:17 140.19 11:58:28 99.25
    1 ID1697620764 NaN 0.0 39.075 15.0 15.0 -38.481935 146.009567 -38.585528 146.199827 20.185 2018-02-10 04:28:17 22.84 4:51:07 149.04
    2 ID1543933503 2.0 0.0 7.243 33.0 28.0 -38.754167 144.509664 -38.242224 147.855342 296.975 2018-05-05 01:38:03 272.52 6:10:34 141.48
    3 ID1756517608 2.0 0.0 13.383 10.0 38.0 -37.240526 147.568019 -37.687178 142.991188 407.396 2018-06-11 11:43:04 371.40 17:54:27 122.82
    4 ID1832325834 2.0 0.0 8.123 1.0 8.0 -38.143985 143.798292 -38.548315 144.769228 95.974 2018-03-16 14:50:25 92.51 16:22:55 111.97
    5 ID1802448576 2.0 0.0 32.859 2.0 28.0 -37.421211 148.044072 -38.159627 148.194048 83.250 2018-05-15 16:35:50 81.12 17:56:57 113.88
    6 ID1940231408 1.0 0.0 20.616 29.0 36.0 -37.173949 143.140662 -37.021605 145.197043 183.363 2018-04-01 19:31:12 184.22 22:35:25 85.60
    7 ID1299303958 2.0 0.0 44.577 36.0 31.0 -37.123190 145.236196 -37.667199 143.877650 134.543 2018-05-01 18:39:36 127.05 20:46:38 114.22
    8 ID1752722028 1.0 0.0 15.363 20.0 30.0 -38.850561 148.317253 -38.024914 144.823938 318.132 2018-05-27 14:48:17 314.64 20:02:55 87.39
    9 ID5995243590 1.0 1.0 36.190 18.0 28.0 -38.070189 142.950207 -37.996817 148.026520 445.106 2018-06-17 12:53:02 437.52 20:10:33 142.95
    10 ID1483358088 2.0 0.0 23.172 13.0 27.0 -38.225456 147.425515 -37.642798 147.124104 70.051 2018-03-19 09:59:10 69.30 11:08:27 96.95
    11 ID1626798395 2.0 0.0 19.754 23.0 26.0 -37.625368 145.838281 -36.789955 147.133916 147.791 2018-02-28 17:40:59 138.92 19:59:54 117.48
    12 ID5277549009 3.0 1.0 12.807 4.0 6.0 -36.855984 142.929596 -36.906838 145.696986 246.465 2018-03-26 07:55:48 201.49 11:17:17 173.32
    13 ID1950928883 2.0 0.0 22.332 33.0 19.0 -38.894115 144.457143 -37.173740 144.152105 193.365 2018-06-28 16:05:55 179.73 19:05:38 117.67
    14 ID5143738648 2.0 1.0 25.880 33.0 5.0 -38.872372 144.606034 -37.553304 145.120753 153.580 2018-05-09 08:33:29 144.10 10:57:34 132.98
    15 ID5132897910 2.0 1.0 38.691 7.0 15.0 -38.844622 144.093195 -38.476630 145.849992 158.102 2018-01-05 15:55:00 148.15 18:23:09 147.73
    16 ID1290889802 1.0 0.0 30.742 19.0 14.0 -37.178059 144.403991 -37.713867 146.382965 184.783 2018-06-05 13:03:43 185.60 16:09:18 84.82
    17 ID5226355535 1.0 1.0 18.055 10.0 18.0 -37.141728 147.256091 -37.983127 143.290272 362.227 2018-05-04 07:42:51 357.32 13:40:10 116.35
    18 ID1898978312 1.0 0.0 5.986 23.0 18.0 -37.706960 145.718119 -37.983127 143.182464 224.997 2018-05-23 12:53:12 224.51 16:37:42 82.47
    19 ID5284908619 2.0 1.0 30.664 13.0 14.0 -38.250238 147.366610 -37.774617 146.503568 92.371 2018-07-11 12:52:57 89.29 14:22:14 143.96
    20 ID1585556406 3.0 0.0 24.942 3.0 39.0 -38.322643 145.505910 -38.453191 148.300405 244.251 2018-03-14 22:16:33 199.74 1:36:17 167.45
    21 ID1901962779 1.0 0.0 34.012 27.0 37.0 -37.516090 146.969053 -38.852488 147.816987 166.236 2018-07-17 16:26:55 167.65 19:14:34 88.98
    22 ID5590279060 1.0 1.0 55.229 7.0 6.0 -38.876145 143.911302 -36.875012 145.759049 275.633 2018-05-01 12:42:19 273.52 17:15:50 688.24
    23 ID1473718059 3.0 0.0 40.741 33.0 17.0 -38.756634 144.375564 -38.817727 147.071495 234.014 2018-06-21 01:04:15 191.66 4:15:54 165.37
    24 ID5551646734 3.0 1.0 15.419 35.0 21.0 -36.922413 146.362349 -37.240137 143.768208 233.068 2018-01-05 07:19:48 190.91 10:30:42 174.52
    25 ID1772122934 2.0 0.0 10.663 29.0 1.0 -37.019910 142.798295 -38.358124 143.943955 179.929 2018-02-27 04:42:03 167.70 7:29:44 124.95
    26 ID1987608852 1.0 0.0 20.685 9.0 35.0 -38.970092 145.435801 -37.068624 146.317717 225.349 2018-05-15 15:35:10 224.85 19:20:00 80.66
    27 ID1249352358 2.0 0.0 10.272 8.0 34.0 -38.485641 144.522135 -38.505484 143.311788 105.471 2018-02-05 06:13:29 101.02 7:54:30 102.09
    28 ID1611614450 1.0 0.0 7.373 16.0 38.0 -36.600877 143.566811 -37.804113 142.793139 150.483 2018-06-18 14:01:42 152.40 16:34:06 82.11
    29 ID1262379299 2.0 0.0 31.358 13.0 16.0 -38.195911 147.436921 -36.734698 143.764279 362.941 2018-04-24 22:40:54 331.59 4:12:29 147.21
    ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
    37839 ID1879423081 2.0 0.0 17.081 20.0 14.0 -38.860358 148.174855 -37.562224 146.443991 209.278 2018-02-15 12:09:15 193.98 15:23:13 121.33
    37840 ID5705840841 1.0 1.0 35.164 29.0 23.0 -37.250331 142.837244 -37.739285 145.963420 281.403 2018-01-01 00:07:35 279.10 4:46:41 136.58
    37841 ID1276239209 3.0 0.0 36.704 12.0 11.0 -36.578568 145.273744 -38.305557 146.997297 245.269 2018-01-24 06:48:05 200.54 10:08:37 146.69
    37842 ID1432868583 3.0 0.0 13.195 37.0 21.0 -38.755802 147.744770 -37.487866 143.585293 390.599 2018-04-07 18:57:52 315.28 0:13:08 170.52
    37844 NaN 1.0 0.0 22.498 30.0 15.0 -37.885792 144.875305 -38.680341 145.874064 124.252 2018-01-28 13:07:09 127.02 15:14:10 74.69
    37845 NaN 3.0 0.0 32.300 16.0 36.0 -36.571169 143.741010 -36.993435 144.983048 120.297 2018-03-26 02:34:49 101.88 4:16:41 162.20
    37846 NaN 3.0 0.0 18.601 38.0 29.0 -37.694132 142.851548 -37.058014 142.873698 70.838 2018-03-02 20:33:11 62.83 21:36:00 142.10
    37850 NaN 2.0 1.0 28.203 18.0 21.0 -38.139260 143.345778 -37.279348 143.604189 98.391 2018-04-17 03:28:41 94.68 5:03:21 152.30
    37851 NaN 2.0 1.0 45.696 28.0 1.0 -38.152835 147.793072 -38.162103 144.043047 328.220 2018-03-25 13:50:57 300.50 18:51:27 167.39
    37852 NaN 1.0 0.0 27.143 38.0 32.0 -37.613135 142.854194 -36.713765 148.383062 500.500 2018-05-12 08:54:52 491.13 17:05:59 97.44
    37854 NaN 1.0 0.0 13.002 27.0 3.0 -37.428594 147.056992 -38.383631 145.590528 167.005 2018-02-20 03:35:43 168.39 6:24:06 92.12
    37857 NaN 1.0 0.0 19.468 5.0 31.0 -37.570514 145.253281 -37.824303 143.862472 125.716 2018-04-20 11:48:14 128.43 13:56:39 69.86
    37860 NaN 1.0 0.0 10.647 21.0 5.0 -37.239046 143.524656 -37.546884 145.204154 152.434 2018-06-13 23:05:39 154.29 1:39:56 95.65
    37861 NaN 2.0 0.0 40.775 22.0 19.0 -36.678761 144.345069 -37.167099 144.257043 54.922 2018-03-27 04:31:43 55.75 5:27:28 120.70
    37863 NaN 1.0 0.0 35.410 9.0 16.0 -39.006941 145.406988 -36.525200 143.517906 322.399 2018-07-21 06:48:39 318.77 12:07:25 88.14
    37866 NaN 1.0 1.0 40.151 14.0 34.0 -37.644061 146.625820 -38.518928 143.369714 301.447 2018-04-14 07:29:13 298.50 12:27:43 118.89
    37869 NaN 3.0 1.0 44.559 33.0 10.0 -38.733722 144.474460 -37.220132 147.498884 314.323 2018-04-18 04:57:37 255.06 9:12:40 201.89
    37873 NaN 1.0 0.0 39.038 7.0 22.0 -38.664403 143.702592 -36.670478 144.312065 228.361 2018-02-02 18:03:25 227.77 21:51:11 87.87
    37874 NaN 1.0 0.0 7.171 14.0 1.0 -37.589382 146.559422 -38.229669 143.710836 260.125 2018-02-03 22:11:31 258.51 2:30:01 106.47
    37875 NaN 1.0 0.0 39.141 5.0 1.0 -37.491746 145.168879 -38.168680 143.936761 131.959 2018-04-19 22:54:02 134.48 1:08:30 94.83
    37876 NaN 2.0 0.0 42.330 25.0 5.0 -36.618297 147.643165 -37.563856 145.038804 254.069 2018-01-27 11:06:09 234.09 15:00:14 116.10
    37877 NaN 1.0 1.0 43.965 25.0 38.0 -36.542082 147.863736 -37.549355 142.980000 448.104 2018-06-14 22:15:50 440.42 5:36:15 149.26
    37879 NaN 1.0 0.0 18.540 36.0 18.0 -36.974023 145.036046 -38.015640 142.955101 217.299 2018-05-28 07:14:50 217.06 10:51:53 76.31
    37885 NaN 1.0 0.0 13.544 34.0 38.0 -38.528632 143.386140 -37.561137 142.835312 118.028 2018-04-09 11:48:46 120.99 13:49:45 69.88
    37888 NaN 1.0 1.0 34.923 36.0 28.0 -37.044972 144.920592 -38.218799 148.050871 305.307 2018-07-20 03:13:54 302.23 8:16:07 130.70
    37889 ID1281653747 1.0 0.0 34.130 8.0 36.0 -38.434373 144.730220 -37.047801 145.309517 162.554 2018-02-06 06:27:55 164.08 9:11:59 69.44
    37892 NaN 3.0 1.0 5.416 20.0 33.0 -38.959090 148.294700 -38.930545 144.661887 314.514 2018-04-27 20:53:02 255.21 1:08:14 185.96
    37894 NaN 2.0 0.0 41.232 38.0 39.0 -37.657406 142.777301 -38.622040 148.366529 500.901 2018-07-02 08:59:29 455.14 16:34:37 139.79
    37897 NaN 1.0 0.0 8.865 9.0 2.0 -38.839254 145.226776 -37.695101 148.251214 293.394 2018-03-11 12:18:21 290.70 17:09:02 88.71
    37898 NaN 3.0 1.0 27.153 39.0 16.0 -38.446310 148.292498 -36.739777 143.604529 454.968 2018-07-23 08:29:19 366.09 14:35:24 188.49

    37869 rows × 16 columns

    任务12:填补 'Post Type'的空值

    提示:可能与id有关

    data_1[data_1['Post Type'].isnull()]
    
    
    Id Drone Type Post Type Package Weight Origin Region Destination Region Origin Latitude Origin Longitude Destination Latitude Destination Longitude Journey Distance Departure Date Departure Time Travel Time Delivery Time Delivery Fare
    2179 ID5742860733 1.0 NaN 27.126 22.0 34.0 -36.679647 144.463364 -38.469625 143.202877 228.181 2018-02-28 14:01:50 227.59 17:49:25 114.27
    4212 ID1710452907 2.0 NaN 37.466 22.0 21.0 -36.695380 144.474875 -37.357030 143.508873 113.113 2018-07-28 11:30:33 107.86 13:18:24 103.59
    4219 ID1787036788 1.0 NaN 30.877 3.0 6.0 -38.311620 145.478355 -36.832092 145.600747 165.050 2018-07-18 11:58:01 166.50 14:44:31 75.01
    4253 ID1680028038 2.0 NaN 13.860 22.0 19.0 -36.837363 144.515369 -37.089526 144.531402 28.106 2018-06-17 04:43:00 31.74 5:14:44 120.43
    6299 ID1377767619 1.0 NaN 45.416 23.0 13.0 -37.720326 146.008038 -38.222304 147.498125 142.197 2018-01-13 09:36:02 144.38 12:00:24 69.85
    ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
    26778 ID1571528676 1.0 NaN 15.062 33.0 9.0 -38.965718 144.650812 -38.962812 145.292314 55.525 2018-04-28 21:13:25 60.51 22:13:55 89.97
    28691 ID5883461461 1.0 NaN 40.325 4.0 29.0 -36.542055 143.024959 -37.232887 142.980514 77.003 2018-05-01 04:18:09 81.29 5:39:26 124.92
    30861 ID1971183564 3.0 NaN 42.700 19.0 4.0 -37.120747 144.287009 -36.762751 143.078632 114.656 2018-02-25 16:19:23 97.43 17:56:48 144.33
    32921 ID5291598933 2.0 NaN 27.189 2.0 7.0 -37.564903 148.097962 -38.797131 143.873724 394.211 2018-01-09 12:39:15 359.59 18:38:50 165.54
    36991 ID1484986030 1.0 NaN 27.517 37.0 36.0 -38.938504 147.755657 -37.060130 144.993660 320.003 2018-01-07 20:24:23 316.45 1:40:49 99.58

    20 rows × 16 columns

    #用投递速度来区分
    def speed(df):
        speed = df['Journey Distance']/df['Travel Time']
        return round(speed,2)
    
    data_1.loc[data_1['Post Type']==0].apply(speed,axis=1)
    
    
    0        1.06
    1        0.88
    2        1.09
    3        1.10
    4        1.04
             ... 
    37837    0.99
    37839    1.08
    37841    1.22
    37842    1.24
    37889    0.99
    Length: 26529, dtype: float64
    
    #用投递价格来区分
    def price(df):
        price = df['Delivery Fare']/df['Package Weight']
        return round(price,2)
    data_1.loc[data_1['Post Type']==0,['Delivery Fare','Package Weight']].apply(price,axis=1)
    
    
    0         4.58
    1         3.81
    2        19.53
    3         9.18
    4        13.78
             ...  
    37837    13.20
    37839     7.10
    37841     4.00
    37842    12.92
    37889     2.03
    Length: 26529, dtype: float64
    
    #ID以5还是1开头为区别
    data_1.loc[data_1['Post Type'].isnull(),'Post Type'] = data_1.loc[data_1['Post Type'].isnull(),'Id'].apply(lambda s:s[2]=='1')
    
    

    任务13:修复 'Origin Longitude'与 'Origin Latitude' 列中错误的值

    data_1['Origin Longitude'].describe()
    #有负值
    
    
    count    37844.000000
    mean       145.423081
    std          6.929107
    min       -148.337157
    25%        143.964265
    50%        145.424189
    75%        147.171954
    max        148.450576
    Name: Origin Longitude, dtype: float64
    
    def fix_Longitude_Latitude(flt):
        if flt<0:
            return -flt
        else:
            return flt
    data_1['Origin Longitude'].apply(fix_Longitude_Latitude).describe()
    
    
    count    37844.000000
    mean       145.577375
    std          1.764044
    min        142.769991
    25%        143.966143
    50%        145.426161
    75%        147.172965
    max        148.450576
    Name: Origin Longitude, dtype: float64
    
    data_1['Origin Longitude'] = data_1['Origin Longitude'].apply(fix_Longitude_Latitude)
    
    
    /anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
    A value is trying to be set on a copy of a slice from a DataFrame.
    Try using .loc[row_indexer,col_indexer] = value instead
    
    See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
      """Entry point for launching an IPython kernel.
    
    data_1['Origin Latitude'].describe()
    
    
    count    37844.000000
    mean       -37.728738
    std          1.900393
    min        -39.006941
    25%        -38.442905
    50%        -37.707244
    75%        -37.094433
    max         38.986998
    Name: Origin Latitude, dtype: float64
    
    data_1['Origin Latitude'] = data_1['Origin Latitude'].apply(fix_Longitude_Latitude)
    
    
    /anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
    A value is trying to be set on a copy of a slice from a DataFrame.
    Try using .loc[row_indexer,col_indexer] = value instead
    
    See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
      """Entry point for launching an IPython kernel.
    

    任务14:修复 'Destination Latitude' 与 'Destination Longitude'列中错误的值

    data_1['Destination Latitude'] = data_1['Destination Latitude'].apply(fix_Longitude_Latitude)
    
    
    /anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
    A value is trying to be set on a copy of a slice from a DataFrame.
    Try using .loc[row_indexer,col_indexer] = value instead
    
    See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
      """Entry point for launching an IPython kernel.
    
    data_1['Destination Longitude'] = data_1['Destination Longitude'].apply(fix_Longitude_Latitude)
    
    
    /anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
    A value is trying to be set on a copy of a slice from a DataFrame.
    Try using .loc[row_indexer,col_indexer] = value instead
    
    See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
      """Entry point for launching an IPython kernel.
    

    任务 15:填补'Origin Region'和‘Destination Region’中的空值

    data_1[data_1['Origin Region'].isnull()]
    
    
    Id Drone Type Post Type Package Weight Origin Region Destination Region Origin Latitude Origin Longitude Destination Latitude Destination Longitude Journey Distance Departure Date Departure Time Travel Time Delivery Time Delivery Fare
    2080 ID1634835235 1.0 0 31.871 NaN 22.0 37.300712 143.610855 36.794684 144.280083 81.904 2018-06-09 03:21:49 86.04 4:47:51 95.70
    4442 ID1321659107 2.0 0 12.991 NaN 10.0 38.180248 146.775985 37.396662 147.313624 99.224 2018-04-12 11:25:54 95.42 13:01:19 98.48
    8220 ID1522212019 1.0 0 25.282 NaN 23.0 36.756457 148.437792 37.665363 145.818608 253.275 2018-03-27 17:56:31 251.88 22:08:23 94.56
    8228 ID1946204979 2.0 0 12.779 NaN 25.0 37.391203 148.116452 36.520786 147.557258 108.914 2018-07-08 10:57:12 104.10 12:41:17 102.49
    15023 ID1891262263 1.0 0 36.466 NaN 7.0 37.537619 145.202427 38.723869 143.685908 187.265 2018-03-13 07:21:08 188.00 10:29:08 72.78
    20513 ID1490968406 1.0 0 42.648 NaN 9.0 36.654443 143.691689 38.849982 145.479151 290.643 2018-02-16 18:46:02 288.04 23:34:04 99.81
    20515 ID5148310393 1.0 1 44.505 NaN 7.0 38.973121 142.989422 38.823172 143.990199 88.293 2018-04-18 23:35:27 92.22 1:07:40 119.24
    28710 ID5941350307 1.0 1 17.232 NaN 32.0 37.616289 146.052853 36.696768 148.280560 222.563 2018-02-22 02:17:46 222.16 5:59:55 125.46
    34853 ID5234294750 2.0 1 29.063 NaN 7.0 37.048542 142.843945 38.852320 143.963431 223.542 2018-07-14 13:22:02 206.75 16:48:47 151.47
    36904 ID1724943602 1.0 0 31.774 NaN 31.0 36.736858 143.688416 37.548149 143.885770 91.993 2018-02-12 14:31:24 95.80 16:07:11 81.57
    data_1.loc[data_1['Origin Region'].notnull()]
    
    
    Id Drone Type Post Type Package Weight Origin Region Destination Region Origin Latitude Origin Longitude Destination Latitude Destination Longitude Journey Distance Departure Date Departure Time Travel Time Delivery Time Delivery Fare
    0 ID1645282128 2.0 0 21.686 19.0 38.0 37.089338 144.429529 37.639134 142.891391 149.212 2018-01-16 09:38:17 140.19 11:58:28 99.25
    1 ID1697620764 NaN 0 39.075 15.0 15.0 38.481935 146.009567 38.585528 146.199827 20.185 2018-02-10 04:28:17 22.84 4:51:07 149.04
    2 ID1543933503 2.0 0 7.243 33.0 28.0 38.754167 144.509664 38.242224 147.855342 296.975 2018-05-05 01:38:03 272.52 6:10:34 141.48
    3 ID1756517608 2.0 0 13.383 10.0 38.0 37.240526 147.568019 37.687178 142.991188 407.396 2018-06-11 11:43:04 371.40 17:54:27 122.82
    4 ID1832325834 2.0 0 8.123 1.0 8.0 38.143985 143.798292 38.548315 144.769228 95.974 2018-03-16 14:50:25 92.51 16:22:55 111.97
    ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
    37839 ID1879423081 2.0 0 17.081 20.0 14.0 38.860358 148.174855 37.562224 146.443991 209.278 2018-02-15 12:09:15 193.98 15:23:13 121.33
    37840 ID5705840841 1.0 1 35.164 29.0 23.0 37.250331 142.837244 37.739285 145.963420 281.403 2018-01-01 00:07:35 279.10 4:46:41 136.58
    37841 ID1276239209 3.0 0 36.704 12.0 11.0 36.578568 145.273744 38.305557 146.997297 245.269 2018-01-24 06:48:05 200.54 10:08:37 146.69
    37842 ID1432868583 3.0 0 13.195 37.0 21.0 38.755802 147.744770 37.487866 143.585293 390.599 2018-04-07 18:57:52 315.28 0:13:08 170.52
    37889 NaN 1.0 0 34.130 8.0 36.0 38.434373 144.730220 37.047801 145.309517 162.554 2018-02-06 06:27:55 164.08 9:11:59 69.44

    37834 rows × 16 columns

    data_2 = data_1.loc[data_1['Origin Region'].notnull()]
    def fill_region(df):
        longitude,latitude = df['Origin Longitude'],df['Origin Latitude']
        distance_0 = 100
        region_out = ''
    
        for index in data_2.index:
            longitude_ = data_2.loc[index,'Origin Longitude']
            latitude_ = data_2.loc[index,'Origin Latitude']
            region = data_2.loc[index,'Origin Region']
            
            distance = (longitude-longitude_)**2+(latitude_-latitude)**2
            if distance<distance_0 :
                region_out = region
                distance_0 = distance
        return region_out
    
    
    data_1.loc[data_1['Origin Region'].isnull()].apply(fill_region,axis=1)
    
    
    2080     21.0
    4442     11.0
    8220     32.0
    8228      2.0
    15023     5.0
    20513    16.0
    20515    24.0
    28710    23.0
    34853    29.0
    36904    16.0
    dtype: float64
    
    data_1.loc[data_1['Origin Region'].isnull()]
    
    
    Id Drone Type Post Type Package Weight Origin Region Destination Region Origin Latitude Origin Longitude Destination Latitude Destination Longitude Journey Distance Departure Date Departure Time Travel Time Delivery Time Delivery Fare
    2080 ID1634835235 1.0 0 31.871 NaN 22.0 37.300712 143.610855 36.794684 144.280083 81.904 2018-06-09 03:21:49 86.04 4:47:51 95.70
    4442 ID1321659107 2.0 0 12.991 NaN 10.0 38.180248 146.775985 37.396662 147.313624 99.224 2018-04-12 11:25:54 95.42 13:01:19 98.48
    8220 ID1522212019 1.0 0 25.282 NaN 23.0 36.756457 148.437792 37.665363 145.818608 253.275 2018-03-27 17:56:31 251.88 22:08:23 94.56
    8228 ID1946204979 2.0 0 12.779 NaN 25.0 37.391203 148.116452 36.520786 147.557258 108.914 2018-07-08 10:57:12 104.10 12:41:17 102.49
    15023 ID1891262263 1.0 0 36.466 NaN 7.0 37.537619 145.202427 38.723869 143.685908 187.265 2018-03-13 07:21:08 188.00 10:29:08 72.78
    20513 ID1490968406 1.0 0 42.648 NaN 9.0 36.654443 143.691689 38.849982 145.479151 290.643 2018-02-16 18:46:02 288.04 23:34:04 99.81
    20515 ID5148310393 1.0 1 44.505 NaN 7.0 38.973121 142.989422 38.823172 143.990199 88.293 2018-04-18 23:35:27 92.22 1:07:40 119.24
    28710 ID5941350307 1.0 1 17.232 NaN 32.0 37.616289 146.052853 36.696768 148.280560 222.563 2018-02-22 02:17:46 222.16 5:59:55 125.46
    34853 ID5234294750 2.0 1 29.063 NaN 7.0 37.048542 142.843945 38.852320 143.963431 223.542 2018-07-14 13:22:02 206.75 16:48:47 151.47
    36904 ID1724943602 1.0 0 31.774 NaN 31.0 36.736858 143.688416 37.548149 143.885770 91.993 2018-02-12 14:31:24 95.80 16:07:11 81.57

    任务16:找出 'Departure Date'中错误的值

    data_1['Departure Date'].sort_values(ascending=False)
    
    
    31975    2018-30-06
    4990     2018-28-06
    17911    2018-28-06
    28350    2018-28-05
    35740    2018-28-05
                ...    
    12702    2018-01-01
    36755    2018-01-01
    12684    2018-01-01
    36782    2018-01-01
    27230    2018-01-01
    Name: Departure Date, Length: 37844, dtype: object
    
    def fix_date(dt):
        split_ = dt.split('-')
        year,month,day = split_[0],split_[1],split_[2]
        if month > '08':
            return year+'-'+day+'-'+month
        else:
            return dt
    data_1['Departure Date'] = data_1['Departure Date'].apply(fix_date)
    
    /anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:8: SettingWithCopyWarning: 
    A value is trying to be set on a copy of a slice from a DataFrame.
    Try using .loc[row_indexer,col_indexer] = value instead
    
    See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
    
    data_1['Departure Date'].sort_values(ascending=False)
    
    17849    2018-07-28
    17570    2018-07-28
    988      2018-07-28
    592      2018-07-28
    12772    2018-07-28
                ...    
    29092    2018-01-01
    5232     2018-01-01
    12584    2018-01-01
    29041    2018-01-01
    5519     2018-01-01
    Name: Departure Date, Length: 37844, dtype: object
    

    任务17:输出数据集为‘solution.csv’到当前目录下面

    data_1.to_csv('solution.csv')
    
  • 相关阅读:
    2018年12月9日 带小苗苗打针 函数2 前向引用 函数即变量
    2018年12月8日 函数变量与递归
    2018年12月7日 字符串格式化2 format与函数1
    2018年12月6日 字符串拼接 %的用法
    2018年11月29日 16点50分 小苗苗出生了
    2018年11月27日 分类与集合
    2018年11月26日 练习3
    2018年11月25日 练习2
    2018年11月24日 周末学习1 字典2
    2018年11月22日 字典 E18灯翼平整度 D&G is SB
  • 原文地址:https://www.cnblogs.com/Yang-Sen/p/11997699.html
Copyright © 2011-2022 走看看