zoukankan      html  css  js  c++  java
  • 数据分析day03

    数据分析day03

    4.基于pandas的数据清洗

    处理丢失数据

    • 有两种丢失数据:
      • None
      • np.nan(NaN)

    In [1]:

    import numpy as np
    import pandas as pd
    from pandas import Series,DataFrame
    import tushare as ts#财经数据接口包
    import matplotlib.pyplot as plt
    
    • 两种丢失数据的区别

    In [2]:

    type(np.nan)
    

    Out[2]:

    float
    

    In [5]:

    np.nan + 3
    

    Out[5]:

    nan
    

    In [3]:

    type(None)
    

    Out[3]:

    NoneType
    
    • pandas中的None和NAN

    In [10]:

    df = DataFrame(data=np.random.randint(0,100,size=(8,5)))
    df
    

    Out[10]:

    0 1 2 3 4
    0 44 91 92 51 55
    1 23 22 92 35 83
    2 21 52 40 63 29
    3 94 51 24 70 59
    4 27 78 1 21 17
    5 94 57 5 43 22
    6 87 31 58 30 82
    7 93 28 54 7 93

    In [12]:

    df.iloc[1,2] = None
    df.iloc[3,4] = None
    df.iloc[4,1] = None
    df.iloc[7,4] = np.nan
    
    

    In [13]:

    df
    
    

    Out[13]:

    0 1 2 3 4
    0 44 91.0 92.0 51 55.0
    1 23 22.0 NaN 35 83.0
    2 21 52.0 40.0 63 29.0
    3 94 51.0 24.0 70 NaN
    4 27 NaN 1.0 21 17.0
    5 94 57.0 5.0 43 22.0
    6 87 31.0 58.0 30 82.0
    7 93 28.0 54.0 7 NaN

    pandas处理空值操作

    • isnull
    • notnull
    • any
    • all
    • dropna
    • fillna

    In [16]:

    df.isnull()
    
    

    Out[16]:

    0 1 2 3 4
    0 False False False False False
    1 False False True False False
    2 False False False False False
    3 False False False False True
    4 False True False False False
    5 False False False False False
    6 False False False False False
    7 False False False False True
    • 检测出原始数据中哪些行中存在空值

    In [20]:

    df.isnull()
    
    

    Out[20]:

    0 1 2 3 4
    0 False False False False False
    1 False False True False False
    2 False False False False False
    3 False False False False True
    4 False True False False False
    5 False False False False False
    6 False False False False False
    7 False False False False True
    • any和all可以帮我们检测df中哪些行列中存在空值
    • isnull->any(axis=1)
    • notnull->all(axis=1)

    In [24]:

    ~df.isnull().any(axis=1)
    df.loc[~df.isnull().any(axis=1)]
    
    

    Out[24]:

    0 1 2 3 4
    0 44 91.0 92.0 51 55.0
    2 21 52.0 40.0 63 29.0
    5 94 57.0 5.0 43 22.0
    6 87 31.0 58.0 30 82.0

    In [28]:

    df.notnull().all(axis=1)
    df.loc[df.notnull().all(axis=1)]
    
    

    Out[28]:

    0 1 2 3 4
    0 44 91.0 92.0 51 55.0
    2 21 52.0 40.0 63 29.0
    5 94 57.0 5.0 43 22.0
    6 87 31.0 58.0 30 82.0

    In [29]:

    df.dropna(axis=0)  #将空值对应的行数据删除
    
    

    Out[29]:

    0 1 2 3 4
    0 44 91.0 92.0 51 55.0
    2 21 52.0 40.0 63 29.0
    5 94 57.0 5.0 43 22.0
    6 87 31.0 58.0 30 82.0

    In [32]:

    df
    
    

    Out[32]:

    0 1 2 3 4
    0 44 91.0 92.0 51 55.0
    1 23 22.0 NaN 35 83.0
    2 21 52.0 40.0 63 29.0
    3 94 51.0 24.0 70 NaN
    4 27 NaN 1.0 21 17.0
    5 94 57.0 5.0 43 22.0
    6 87 31.0 58.0 30 82.0
    7 93 28.0 54.0 7 NaN

    In [33]:

    #fillna将空值进行覆盖
    df.fillna(method='bfill',axis=0) #使用紧邻值填充空值
    
    

    Out[33]:

    0 1 2 3 4
    0 44 91.0 92.0 51 55.0
    1 23 22.0 92.0 35 83.0
    2 21 52.0 40.0 63 29.0
    3 94 51.0 24.0 70 29.0
    4 27 51.0 1.0 21 17.0
    5 94 57.0 5.0 43 22.0
    6 87 31.0 58.0 30 82.0
    7 93 28.0 54.0 7 82.0

    面试题

    • 数据说明:
      • 数据是1个冷库的温度数据,1-7对应7个温度采集设备,1分钟采集一次。
    • 数据处理目标:
      • 用1-4对应的4个必须设备,通过建立冷库的温度场关系模型,预估出5-7对应的数据。
      • 最后每个冷库中仅需放置4个设备,取代放置7个设备。
      • f(1-4) --> y(5-7)
    • 数据处理过程:
      • 1、原始数据中有丢帧现象,需要做预处理;
      • 2、matplotlib 绘图;
      • 3、建立逻辑回归模型。
    • 无标准答案,按个人理解操作即可,请把自己的操作过程以文字形式简单描述一下,谢谢配合。
    • 测试数据为testData.xlsx

    In [5]:

    df = pd.read_excel('./testData.xlsx')
    df.drop(labels=['none','none1'],axis=1,inplace=True)
    
    

    In [7]:

    df.dropna(axis=0)
    
    

    . . .

    In [10]:

    #isnull notnull any all
    df.notnull().all(axis=1)
    df.loc[df.notnull().all(axis=1)]
    
    

    . . .

    In [15]:

    df.fillna(method='ffill',axis=0).fillna(method='bfill',axis=0)
    
    

    . . .

    In [ ]:

    
    

    处理重复数据

    In [20]:

    df = DataFrame(data=np.random.randint(0,100,size=(8,5)))
    df.iloc[1] = [6,6,6,6,6]
    df.iloc[3] = [6,6,6,6,6]
    df.iloc[5] = [6,6,6,6,6]
    df
    
    

    Out[20]:

    0 1 2 3 4
    0 44 68 53 32 24
    1 6 6 6 6 6
    2 79 86 73 14 25
    3 6 6 6 6 6
    4 72 1 73 67 89
    5 6 6 6 6 6
    6 69 32 94 91 18
    7 47 7 77 11 67

    In [25]:

    df.drop_duplicates(keep='first')
    
    

    Out[25]:

    0 1 2 3 4
    0 44 68 53 32 24
    1 6 6 6 6 6
    2 79 86 73 14 25
    4 72 1 73 67 89
    6 69 32 94 91 18
    7 47 7 77 11 67

    处理异常数据

    • 自定义一个1000行3列(A,B,C)取值范围为0-1的数据源,然后将C列中的值大于其两倍标准差的异常值进行清洗

    In [27]:

    df = DataFrame(data=np.random.random(size=(1000,3)),columns=['A','B','C'])
    df.head()
    
    

    Out[27]:

    A B C
    0 0.886005 0.831529 0.822956
    1 0.391742 0.104039 0.434260
    2 0.539745 0.950540 0.948526
    3 0.392029 0.904831 0.680343
    4 0.513764 0.979957 0.600518

    In [30]:

    #指定一个判定异常值的条件
    twice_std = df['C'].std() * 2
    twice_std
    
    

    Out[30]:

    0.570731429850527
    
    

    In [34]:

    #判断C列中的哪些值为异常值
    df['C'] > twice_std
    df.loc[df['C'] > twice_std]
    indexs = df.loc[df['C'] > twice_std].index #异常值对应的行索引
    
    

    In [35]:

    df.drop(labels=indexs)
    
    

    Out[35]:

    A B C
    1 0.391742 0.104039 0.434260
    5 0.512951 0.897948 0.245320
    6 0.473968 0.979213 0.271424
    7 0.843319 0.038657 0.184559
    8 0.982931 0.881284 0.208147
    10 0.784656 0.314898 0.089802
    13 0.624428 0.252411 0.327818
    15 0.213042 0.969693 0.567275
    18 0.710010 0.534330 0.559441
    19 0.026479 0.736259 0.120431
    20 0.990039 0.982449 0.017151
    23 0.158157 0.183678 0.428155
    25 0.604838 0.950466 0.294742
    26 0.304136 0.822809 0.388579
    28 0.671559 0.726631 0.196907
    30 0.811249 0.751182 0.467697
    31 0.376243 0.805516 0.287484
    33 0.570442 0.797945 0.026182
    35 0.467125 0.062123 0.439725
    36 0.861741 0.413997 0.543973
    38 0.955328 0.817003 0.293787
    47 0.458014 0.228608 0.285172
    49 0.931513 0.403981 0.239329
    51 0.008178 0.484172 0.021373
    53 0.253882 0.300069 0.561118
    55 0.752559 0.685649 0.451692
    56 0.003363 0.486893 0.154598
    57 0.859653 0.569252 0.007432
    58 0.327716 0.419704 0.452710
    59 0.068403 0.029346 0.226587
    ... ... ... ...
    953 0.247954 0.072558 0.038834
    954 0.199553 0.193049 0.027725
    956 0.513195 0.175896 0.254432
    957 0.080261 0.476756 0.521142
    958 0.944795 0.550317 0.336043
    961 0.464895 0.592027 0.195383
    962 0.127469 0.300982 0.309427
    963 0.595242 0.139702 0.450026
    964 0.520342 0.639537 0.209403
    965 0.372687 0.117984 0.262849
    966 0.007270 0.044250 0.533105
    967 0.854830 0.512720 0.173844
    968 0.247666 0.972284 0.227422
    970 0.047074 0.714412 0.392280
    974 0.112649 0.483324 0.125105
    975 0.307405 0.875641 0.432340
    978 0.520662 0.003040 0.412422
    979 0.337178 0.540283 0.257443
    981 0.877978 0.842195 0.448030
    982 0.273752 0.063285 0.291012
    985 0.765849 0.974933 0.253099
    988 0.139305 0.570496 0.535778
    989 0.597190 0.973190 0.177517
    990 0.817945 0.183825 0.330112
    991 0.738457 0.578425 0.032489
    992 0.159229 0.544980 0.242586
    994 0.300998 0.352331 0.434336
    996 0.609123 0.491735 0.045738
    998 0.839935 0.181189 0.121180
    999 0.798840 0.939869 0.150332

    5.DataFrame的级联and合并操作

    级联操作

    • pd.concat, pd.append

    pandas使用pd.concat函数,与np.concatenate函数类似,只是多了一些参数:

    objs
    axis=0
    keys
    join='outer' / 'inner':表示的是级联的方式,outer会将所有的项进行级联(忽略匹配和不匹配),而inner只会将匹配的项级联到一起,不匹配的不级联
    ignore_index=False
    
    
    • 匹配级联

    In [1]:

    import numpy as np
    import pandas as pd
    from pandas import DataFrame
    
    

    In [2]:

    df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
                    'group':['Accounting','Engineering','Engineering'],
                    })
    df1
    
    
    

    Out[2]:

    employee group
    0 Bob Accounting
    1 Jake Engineering
    2 Lisa Engineering

    In [6]:

    pd.concat((df1,df1,df1),axis=1)
    
    

    Out[6]:

    employee group employee group employee group
    0 Bob Accounting Bob Accounting Bob Accounting
    1 Jake Engineering Jake Engineering Jake Engineering
    2 Lisa Engineering Lisa Engineering Lisa Engineering
    • 不匹配级联
      • 不匹配指的是级联的维度的索引不一致。例如纵向级联时列索引不一致,横向级联时行索引不一致
      • 有2种连接方式:
        • 外连接:补NaN(默认模式)
        • 内连接:只连接匹配的项

    In [7]:

    df1
    
    

    Out[7]:

    employee group
    0 Bob Accounting
    1 Jake Engineering
    2 Lisa Engineering

    In [12]:

    df2 = df1.copy()
    df2.columns = ['employee','groups']
    df2
    
    

    Out[12]:

    employee groups
    0 Bob Accounting
    1 Jake Engineering
    2 Lisa Engineering

    In [14]:

    pd.concat((df1,df2),axis=0)
    
    

    Out[14]:

    employee group groups
    0 Bob Accounting NaN
    1 Jake Engineering NaN
    2 Lisa Engineering NaN
    0 Bob NaN Accounting
    1 Jake NaN Engineering
    2 Lisa NaN Engineering

    In [16]:

    pd.concat((df1,df2),axis=0,join='inner')
    
    

    Out[16]:

    employee
    0 Bob
    1 Jake
    2 Lisa
    0 Bob
    1 Jake
    2 Lisa
    • append函数的使用

    In [17]:

    df1.append(df2)
    
    

    Out[17]:

    employee group groups
    0 Bob Accounting NaN
    1 Jake Engineering NaN
    2 Lisa Engineering NaN
    0 Bob NaN Accounting
    1 Jake NaN Engineering
    2 Lisa NaN Engineering

    合并操作:级联是对表格做拼接,合并是对表格中的数据进行汇总

    • merge与concat的区别在于,merge需要依据某一共同列来进行合并
    • 使用pd.merge()合并时,会自动根据两者相同column名称的那一列,作为key来进行合并。
    • 注意每一列元素的顺序不要求一致

    一对一合并

    In [18]:

    df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
                    'group':['Accounting','Engineering','Engineering'],
                    })
    df1
    
    

    Out[18]:

    employee group
    0 Bob Accounting
    1 Jake Engineering
    2 Lisa Engineering

    In [19]:

    df2 = DataFrame({'employee':['Lisa','Bob','Jake'],
                    'hire_date':[2004,2008,2012],
                    })
    df2
    
    

    Out[19]:

    employee hire_date
    0 Lisa 2004
    1 Bob 2008
    2 Jake 2012

    In [20]:

    pd.merge(df1,df2)
    
    

    Out[20]:

    employee group hire_date
    0 Bob Accounting 2008
    1 Jake Engineering 2012
    2 Lisa Engineering 2004

    一对多合并

    In [21]:

    df3 = DataFrame({
        'employee':['Lisa','Jake'],
        'group':['Accounting','Engineering'],
        'hire_date':[2004,2016]})
    df3
    
    

    Out[21]:

    employee group hire_date
    0 Lisa Accounting 2004
    1 Jake Engineering 2016

    In [22]:

    df4 = DataFrame({'group':['Accounting','Engineering','Engineering'],
                           'supervisor':['Carly','Guido','Steve']
                    })
    df4
    
    

    Out[22]:

    group supervisor
    0 Accounting Carly
    1 Engineering Guido
    2 Engineering Steve

    In [23]:

    pd.merge(df3,df4)
    
    

    Out[23]:

    employee group hire_date supervisor
    0 Lisa Accounting 2004 Carly
    1 Jake Engineering 2016 Guido
    2 Jake Engineering 2016 Steve

    多对多合并

    In [24]:

    df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
                     'group':['Accounting','Engineering','Engineering']})
    df1
    
    

    Out[24]:

    employee group
    0 Bob Accounting
    1 Jake Engineering
    2 Lisa Engineering

    In [25]:

    df5 = DataFrame({'group':['Engineering','Engineering','HR'],
                    'supervisor':['Carly','Guido','Steve']
                    })
    df5
    
    

    Out[25]:

    group supervisor
    0 Engineering Carly
    1 Engineering Guido
    2 HR Steve

    In [28]:

    pd.merge(df1,df5)
    
    

    Out[28]:

    employee group supervisor
    0 Jake Engineering Carly
    1 Jake Engineering Guido
    2 Lisa Engineering Carly
    3 Lisa Engineering Guido

    key的规范化

    • 当列冲突时,即有多个列名称相同时,需要使用on=来指定哪一个列作为key,配合suffixes指定冲突列名

    In [29]:

    df1 = DataFrame({'employee':['Jack',"Summer","Steve"],
                     'group':['Accounting','Finance','Marketing']})
    df1
    
    

    Out[29]:

    employee group
    0 Jack Accounting
    1 Summer Finance
    2 Steve Marketing

    In [30]:

    df2 = DataFrame({'employee':['Jack','Bob',"Jake"],
                     'hire_date':[2003,2009,2012],
                    'group':['Accounting','sell','ceo']})
    df2
    
    

    Out[30]:

    employee group hire_date
    0 Jack Accounting 2003
    1 Bob sell 2009
    2 Jake ceo 2012

    In [32]:

    pd.merge(df1,df2,on='group')
    
    

    Out[32]:

    employee_x group employee_y hire_date
    0 Jack Accounting Jack 2003
    • 当两张表没有可进行连接的列时,可使用left_on和right_on手动指定merge中左右两边的哪一列列作为连接的列

    In [33]:

    df1 = DataFrame({'employee':['Bobs','Linda','Bill'],
                    'group':['Accounting','Product','Marketing'],
                   'hire_date':[1998,2017,2018]})
    df1
    
    

    Out[33]:

    employee group hire_date
    0 Bobs Accounting 1998
    1 Linda Product 2017
    2 Bill Marketing 2018

    In [34]:

    df5 = DataFrame({'name':['Lisa','Bobs','Bill'],
                    'hire_dates':[1998,2016,2007]})
    df5
    
    

    Out[34]:

    hire_dates name
    0 1998 Lisa
    1 2016 Bobs
    2 2007 Bill

    In [35]:

    pd.merge(df1,df5,left_on='employee',right_on='name')
    
    

    Out[35]:

    employee group hire_date hire_dates name
    0 Bobs Accounting 1998 2016 Bobs
    1 Bill Marketing 2018 2007 Bill

    内合并与外合并:out取并集 inner取交集

    In [37]:

    df6 = DataFrame({'name':['Peter','Paul','Mary'],
                   'food':['fish','beans','bread']}
                   )
    df7 = DataFrame({'name':['Mary','Joseph'],
                    'drink':['wine','beer']})
    
    
    

    In [38]:

    df6
    
    

    Out[38]:

    food name
    0 fish Peter
    1 beans Paul
    2 bread Mary

    In [39]:

    df7
    
    

    Out[39]:

    drink name
    0 wine Mary
    1 beer Joseph

    In [43]:

    pd.merge(df6,df7,how='right')
    
    

    Out[43]:

    food name drink
    0 bread Mary wine
    1 NaN Joseph beer

    In [ ]:

    df6 = DataFrame({'name':['Peter','Paul','Mary'],
                   'food':['fish','beans','bread']}
                   )
    df7 = DataFrame({'name':['Mary','Joseph'],
                    'drink':['wine','beer']})
    
    
    

    In [ ]:

    
    

    In [ ]:

    
    

    In [ ]:

    #合并df1和df2
    dic1={
        
        'name':['tom','jay','helly'],
        'age':[11,12,33],
        'classRoom':[1,2,3]
    }
    df1=DataFrame(data=dic1)
    df2=DataFrame(data=np.random.randint(60,100,size=(3,3)),
                  index=['jay','tom','helly'],
                 columns=['java','python','c'])
    
    
    

    6.人口分析案例

    • 需求:
      • 导入文件,查看原始数据
      • 将人口数据和各州简称数据进行汇总
      • 将汇总的数据中重复的abbreviation列进行删除
      • 查看存在缺失数据的列
      • 找到有哪些state/region使得state的值为NaN,进行去重操作
      • 为找到的这些state/region的state项补上正确的值,从而去除掉state这一列的所有NaN
      • 合并各州面积数据areas
      • 我们会发现area(sq.mi)这一列有缺失数据,找出是哪些行
      • 去除含有缺失数据的行
      • 找出2010年的全民人口数据
      • 计算各州的人口密度
      • 排序,并找出人口密度最高的五个州

    In [1]:

    import pandas as pd
    from pandas import DataFrame
    import numpy as np
    
    

    In [10]:

    abb = pd.read_csv('./data/state-abbrevs.csv') #存储的是各州的全程和简称数据
    area = pd.read_csv('./data/state-areas.csv') #存储的是各州的全程和面积
    pop = pd.read_csv('./data/state-population.csv') #人口数据
    
    

    In [7]:

    pop.head(2)
    
    

    Out[7]:

    state/region ages year population
    0 AL under18 2012 1117489.0
    1 AL total 2012 4817528.0

    In [8]:

    abb.head(2)
    
    

    Out[8]:

    state abbreviation
    0 Alabama AL
    1 Alaska AK

    In [12]:

    #数据汇总(合并)
    abb_pop = pd.merge(abb,pop,how='outer',left_on='abbreviation',right_on='state/region')
    abb_pop.head()
    
    

    Out[12]:

    state abbreviation state/region ages year population
    0 Alabama AL AL under18 2012 1117489.0
    1 Alabama AL AL total 2012 4817528.0
    2 Alabama AL AL under18 2010 1130966.0
    3 Alabama AL AL total 2010 4785570.0
    4 Alabama AL AL under18 2011 1125763.0

    In [13]:

    abb_pop.drop(labels='abbreviation',axis=1,inplace=True)
    
    

    In [15]:

    #查看存在缺失数据的列
    abb_pop.info()
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 2544 entries, 0 to 2543
    Data columns (total 5 columns):
    state           2448 non-null object
    state/region    2544 non-null object
    ages            2544 non-null object
    year            2544 non-null int64
    population      2524 non-null float64
    dtypes: float64(1), int64(1), object(3)
    memory usage: 119.2+ KB
    
    

    In [16]:

    ##查看存在缺失数据的列
    abb_pop.isnull().any(axis=0)
    
    

    Out[16]:

    state            True
    state/region    False
    ages            False
    year            False
    population       True
    dtype: bool
    
    

    In [19]:

    #找到有哪些state/region(洲的简称)使得state(洲的全程)的值为NaN,进行(简称)去重操作
    abb_pop.head()
    
    

    Out[19]:

    state state/region ages year population
    0 Alabama AL under18 2012 1117489.0
    1 Alabama AL total 2012 4817528.0
    2 Alabama AL under18 2010 1130966.0
    3 Alabama AL total 2010 4785570.0
    4 Alabama AL under18 2011 1125763.0

    In [24]:

    #1.找出全程的空值都有哪些
    abb_pop['state'].isnull()
    #2.将空值对应的行数据取出
    abb_pop.loc[abb_pop['state'].isnull()]
    #3.将空值对应的简称取出
    abb_pop.loc[abb_pop['state'].isnull()]['state/region']
    #4,对取出的简称对应的Series进行去重操作
    abb_pop.loc[abb_pop['state'].isnull()]['state/region'].unique()#unique()是用来将Series中的元素进行去重操作
    
    

    Out[24]:

    array(['PR', 'USA'], dtype=object)
    
    

    In [ ]:

    #为找到的这些state/region的state项补上正确的值,从而去除掉state这一列的所有NaN
    #将PR简称对应的全程的空值定位到,将这些空值赋值成PR的全程PUERTO RICO
    
    

    In [29]:

    abb_pop['state/region'] == 'PR'
    #取出PR简称对应的行数据
    abb_pop.loc[abb_pop['state/region'] == 'PR']
    #将定位到的行数据中state列中的空值统一赋值成PR的简称
    indexs = abb_pop.loc[abb_pop['state/region'] == 'PR'].index#PR简称对应全程的空值的行索引
    abb_pop.loc[indexs,'state'] = 'PUERTO RICO'
    
    

    In [31]:

    #同理可以将USA对应全程的值赋值成United States
    abb_pop['state/region'] == 'USA'
    abb_pop.loc[abb_pop['state/region'] == 'USA']
    indexs = abb_pop.loc[abb_pop['state/region'] == 'USA'].index
    abb_pop.loc[indexs,'state'] = 'United States'
    
    

    In [39]:

    #合并各州面积数据areas
    abb_pop_area = pd.merge(abb_pop,area,how='outer')
    abb_pop_area.head()
    
    

    Out[39]:

    state state/region ages year population area (sq. mi)
    0 Alabama AL under18 2012.0 1117489.0 52423.0
    1 Alabama AL total 2012.0 4817528.0 52423.0
    2 Alabama AL under18 2010.0 1130966.0 52423.0
    3 Alabama AL total 2010.0 4785570.0 52423.0
    4 Alabama AL under18 2011.0 1125763.0 52423.0

    In [45]:

    #我们会发现area(sq.mi)这一列有缺失数据,找出是哪些行
    indexs = abb_pop_area.loc[abb_pop_area['area (sq. mi)'].isnull()].index
    indexs
    
    

    . . .

    In [43]:

    abb_pop_area.drop(labels=indexs,inplace=True)
    
    

    In [48]:

    #找出2010年的全民人口数据 query('查询条件')
    abb_pop_area.query('year==2010 & ages == "total"')
    
    

    . . .

    In [ ]:

    #计算各州的人口密度
    
    
    

    In [51]:

    abb_pop_area['midu'] = abb_pop_area['population'] / abb_pop_area['area (sq. mi)']
    abb_pop_area.head()
    
    

    Out[51]:

    state state/region ages year population area (sq. mi) midu
    0 Alabama AL under18 2012.0 1117489.0 52423.0 21.316769
    1 Alabama AL total 2012.0 4817528.0 52423.0 91.897221
    2 Alabama AL under18 2010.0 1130966.0 52423.0 21.573851
    3 Alabama AL total 2010.0 4785570.0 52423.0 91.287603
    4 Alabama AL under18 2011.0 1125763.0 52423.0 21.474601

    In [53]:

    #排序,并找出人口密度最高的五个州
    abb_pop_area.sort_values(by='midu',axis=0,ascending=False)
    
    

    Out[53]:

    state state/region ages year population area (sq. mi) midu
    391 District of Columbia DC total 2013.0 646449.0 68.0 9506.602941
    385 District of Columbia DC total 2012.0 633427.0 68.0 9315.102941
    387 District of Columbia DC total 2011.0 619624.0 68.0 9112.117647
    431 District of Columbia DC total 1990.0 605321.0 68.0 8901.779412
    389 District of Columbia DC total 2010.0 605125.0 68.0 8898.897059
    426 District of Columbia DC total 1991.0 600870.0 68.0 8836.323529
    429 District of Columbia DC total 1992.0 597567.0 68.0 8787.750000
    422 District of Columbia DC total 1993.0 595302.0 68.0 8754.441176
    392 District of Columbia DC total 2009.0 592228.0 68.0 8709.235294
    425 District of Columbia DC total 1994.0 589240.0 68.0 8665.294118
    420 District of Columbia DC total 1995.0 580519.0 68.0 8537.044118
    396 District of Columbia DC total 2008.0 580236.0 68.0 8532.882353
    406 District of Columbia DC total 2001.0 574504.0 68.0 8448.588235
    394 District of Columbia DC total 2007.0 574404.0 68.0 8447.117647
    408 District of Columbia DC total 2002.0 573158.0 68.0 8428.794118
    419 District of Columbia DC total 1996.0 572379.0 68.0 8417.338235
    412 District of Columbia DC total 2000.0 572046.0 68.0 8412.441176
    400 District of Columbia DC total 2006.0 570681.0 68.0 8392.367647
    410 District of Columbia DC total 1999.0 570220.0 68.0 8385.588235
    402 District of Columbia DC total 2003.0 568502.0 68.0 8360.323529
    404 District of Columbia DC total 2004.0 567754.0 68.0 8349.323529
    417 District of Columbia DC total 1997.0 567739.0 68.0 8349.102941
    398 District of Columbia DC total 2005.0 567136.0 68.0 8340.235294
    415 District of Columbia DC total 1998.0 565232.0 68.0 8312.235294
    421 District of Columbia DC under18 1995.0 123620.0 68.0 1817.941176
    424 District of Columbia DC under18 1994.0 122170.0 68.0 1796.617647
    418 District of Columbia DC under18 1996.0 121210.0 68.0 1782.500000
    423 District of Columbia DC under18 1993.0 120471.0 68.0 1771.632353
    416 District of Columbia DC under18 1997.0 119531.0 68.0 1757.808824
    428 District of Columbia DC under18 1992.0 118636.0 68.0 1744.647059
    ... ... ... ... ... ... ... ...
    53 Alaska AK total 1994.0 603308.0 656425.0 0.919081
    56 Alaska AK total 1993.0 599434.0 656425.0 0.913180
    50 Alaska AK total 1992.0 588736.0 656425.0 0.896882
    55 Alaska AK total 1991.0 570193.0 656425.0 0.868634
    48 Alaska AK total 1990.0 553290.0 656425.0 0.842884
    63 Alaska AK under18 1998.0 192636.0 656425.0 0.293462
    66 Alaska AK under18 1999.0 191422.0 656425.0 0.291613
    69 Alaska AK under18 2000.0 190615.0 656425.0 0.290384
    71 Alaska AK under18 2001.0 188771.0 656425.0 0.287574
    73 Alaska AK under18 2002.0 188482.0 656425.0 0.287134
    92 Alaska AK under18 2011.0 188329.0 656425.0 0.286901
    62 Alaska AK under18 1997.0 188280.0 656425.0 0.286826
    94 Alaska AK under18 2012.0 188162.0 656425.0 0.286647
    86 Alaska AK under18 2013.0 188132.0 656425.0 0.286601
    90 Alaska AK under18 2010.0 187902.0 656425.0 0.286251
    54 Alaska AK under18 1994.0 187439.0 656425.0 0.285545
    57 Alaska AK under18 1993.0 187190.0 656425.0 0.285166
    75 Alaska AK under18 2003.0 186843.0 656425.0 0.284637
    89 Alaska AK under18 2009.0 186351.0 656425.0 0.283888
    77 Alaska AK under18 2004.0 186335.0 656425.0 0.283863
    81 Alaska AK under18 2006.0 185580.0 656425.0 0.282713
    61 Alaska AK under18 1996.0 185360.0 656425.0 0.282378
    79 Alaska AK under18 2005.0 185304.0 656425.0 0.282293
    59 Alaska AK under18 1995.0 184990.0 656425.0 0.281814
    52 Alaska AK under18 1992.0 184878.0 656425.0 0.281644
    83 Alaska AK under18 2007.0 184344.0 656425.0 0.280830
    85 Alaska AK under18 2008.0 183124.0 656425.0 0.278972
    51 Alaska AK under18 1991.0 182180.0 656425.0 0.277534
    49 Alaska AK under18 1990.0 177502.0 656425.0 0.270407
    2544 Puerto Rico NaN NaN NaN NaN 3515.0 NaN

    7.pandas高级操作

    In [3]:

    import pandas as pd
    from pandas import DataFrame
    import numpy as np
    
    

    替换操作

    • 替换操作可以同步作用于Series和DataFrame中

    • 单值替换

      • 普通替换: 替换所有符合要求的元素:to_replace=15,value='e'
      • 按列指定单值替换: to_replace={列标签:替换值} value='value'
    • 多值替换

      • 列表替换: to_replace=[] value=[]
      • 字典替换(推荐) to_replace={to_replace:value,to_replace:value}

    In [4]:

    df = DataFrame(data=np.random.randint(0,100,size=(6,7)))
    df
    
    

    Out[4]:

    0 1 2 3 4 5 6
    0 44 62 3 85 26 47 14
    1 15 78 32 98 85 4 51
    2 53 75 87 21 45 8 18
    3 54 31 67 49 77 25 49
    4 18 21 18 31 93 11 0
    5 21 54 76 95 70 77 49

    In [5]:

    df.replace(to_replace=3,value='Three')
    
    

    Out[5]:

    0 1 2 3 4 5 6
    0 44 62 Three 85 26 47 14
    1 15 78 32 98 85 4 51
    2 53 75 87 21 45 8 18
    3 54 31 67 49 77 25 49
    4 18 21 18 31 93 11 0
    5 21 54 76 95 70 77 49

    In [6]:

    df.replace(to_replace={3:'aaa'})
    
    

    Out[6]:

    0 1 2 3 4 5 6
    0 44 62 aaa 85 26 47 14
    1 15 78 32 98 85 4 51
    2 53 75 87 21 45 8 18
    3 54 31 67 49 77 25 49
    4 18 21 18 31 93 11 0
    5 21 54 76 95 70 77 49

    In [8]:

    #替换指定列中的值
    df.replace(to_replace={5:77},value='6666666')
    
    

    Out[8]:

    0 1 2 3 4 5 6
    0 44 62 3 85 26 47 14
    1 15 78 32 98 85 4 51
    2 53 75 87 21 45 8 18
    3 54 31 67 49 77 25 49
    4 18 21 18 31 93 11 0
    5 21 54 76 95 70 6666666 49

    映射操作

    • 概念:创建一个映射关系列表,把values元素和一个特定的标签或者字符串绑定(给一个元素值提供不同的表现形式)

    • 创建一个df,两列分别是姓名和薪资,然后给其名字起对应的中文名

    In [10]:

    dic = {
        'name':['jay','tom','jay'],
        'salary':[10000,20000,10000]
    }
    df = DataFrame(data=dic)
    df
    
    

    Out[10]:

    name salary
    0 jay 10000
    1 tom 20000
    2 jay 10000

    In [14]:

    #映射关系表
    dic = {
        'jay':'张三',
        'tom':'李四'
    }
    df['c_name'] = df['name'].map(dic)
    df
    
    

    Out[14]:

    name salary c_name
    0 jay 10000 张三
    1 tom 20000 李四
    2 jay 10000 张三

    运算工具

    • 超过3000部分的钱缴纳50%的税,计算每个人的税后薪资

    In [16]:

    def after_sal(s):
        return s - (s-3000)*0.5
    
    

    In [18]:

    df['after_salary'] = df['salary'].map(after_sal)
    df
    
    

    Out[18]:

    name salary c_name after_salary
    0 jay 10000 张三 6500.0
    1 tom 20000 李四 11500.0
    2 jay 10000 张三 6500.0

    映射索引

    • 使用rename()函数替换行列索引
    • 参数介绍:
      • index 替换行索引
      • columns 替换列索引

    In [19]:

    df4 = DataFrame({'color':['white','gray','purple','blue','green'],'value':np.random.randint(10,size = 5)})
    df4
    
    

    Out[19]:

    color value
    0 white 2
    1 gray 5
    2 purple 9
    3 blue 0
    4 green 1

    In [20]:

    new_index = {0:'first',1:'two',2:'three',3:'four',4:'five'}
    new_col={'color':'cc','value':'vv'}
    df4.rename(new_index,columns=new_col)
    
    

    Out[20]:

    cc vv
    first white 2
    two gray 5
    three purple 9
    four blue 0
    five green 1

    排序实现的随机抽样

    • take()
    • np.random.permutation()

    In [22]:

    df = DataFrame(data=np.random.randint(0,100,size=(100,3)),columns=['A','B','C'])
    df
    
    

    . . .

    In [24]:

    # df.take(['B','A','C'],axis=1)
    df.take([1,0,2],axis=1)
    
    

    . . .

    In [32]:

    np.random.permutation(3) #返回随机序列
    
    

    Out[32]:

    array([0, 1, 2])
    
    

    In [31]:

    #将行列索引打乱
    df.take(np.random.permutation(100),axis=0).take(np.random.permutation(3),axis=1)
    
    

    . . .

    In [35]:

    df.take(np.random.permutation(100),axis=0).take(np.random.permutation(3),axis=1)[0:50]
    
    

    . . .

    数据的分类处理

    • 数据分类处理的核心:
      • groupby()函数
      • groups属性查看分组情况

    In [36]:

    df = DataFrame({'item':['Apple','Banana','Orange','Banana','Orange','Apple'],
                    'price':[4,3,3,2.5,4,2],
                   'color':['red','yellow','yellow','green','green','green'],
                   'weight':[12,20,50,30,20,44]})
    df
    
    

    Out[36]:

    color item price weight
    0 red Apple 4.0 12
    1 yellow Banana 3.0 20
    2 yellow Orange 3.0 50
    3 green Banana 2.5 30
    4 green Orange 4.0 20
    5 green Apple 2.0 44

    In [37]:

    #根据水果的种类进行分组
    df.groupby(by='item')
    
    

    Out[37]:

    <pandas.core.groupby.DataFrameGroupBy object at 0x0000019782507F60>
    
    

    In [38]:

    #调用groups查看分组情况
    df.groupby(by='item').groups
    
    

    Out[38]:

    {'Apple': Int64Index([0, 5], dtype='int64'),
     'Banana': Int64Index([1, 3], dtype='int64'),
     'Orange': Int64Index([2, 4], dtype='int64')}
    
    

    In [40]:

    #计算出每一种水果的平均价格
    df.groupby(by='item').mean()['price']
    
    

    Out[40]:

    item
    Apple     3.00
    Banana    2.75
    Orange    3.50
    Name: price, dtype: float64
    
    

    In [41]:

    df.groupby(by='item')['price'].mean() #推荐
    
    

    Out[41]:

    item
    Apple     3.00
    Banana    2.75
    Orange    3.50
    Name: price, dtype: float64
    
    

    In [42]:

    #计算不同颜色水果的平均重量
    df.groupby(by='color')['weight'].mean()
    
    

    Out[42]:

    color
    green     31.333333
    red       12.000000
    yellow    35.000000
    Name: weight, dtype: float64
    
    

    In [44]:

    #将每一种水果的平均价格计算出来且汇总到原数据中
    df
    
    

    Out[44]:

    color item price weight
    0 red Apple 4.0 12
    1 yellow Banana 3.0 20
    2 yellow Orange 3.0 50
    3 green Banana 2.5 30
    4 green Orange 4.0 20
    5 green Apple 2.0 44

    In [47]:

    series_price = df.groupby(by='item')['price'].mean() 
    dic = series_price.to_dict()
    dic #映射关系表
    
    

    Out[47]:

    {'Apple': 3.0, 'Banana': 2.75, 'Orange': 3.5}
    
    

    In [49]:

    df['mean_price'] = df['item'].map(dic)
    df
    
    

    Out[49]:

    color item price weight mean_price
    0 red Apple 4.0 12 3.00
    1 yellow Banana 3.0 20 2.75
    2 yellow Orange 3.0 50 3.50
    3 green Banana 2.5 30 2.75
    4 green Orange 4.0 20 3.50
    5 green Apple 2.0 44 3.00

    高级数据聚合

    • 使用groupby分组后,也可以使用transform和apply提供自定义函数实现更多的运算
    • df.groupby('item')['price'].sum() <==> df.groupby('item')['price'].apply(sum)
    • transform和apply都会进行运算,在transform或者apply中传入函数即可
    • transform和apply也可以传入一个lambda表达式

    In [56]:

    def myMean(s):
        sum = 0
        for i in s:
            sum += i
        return sum/len(s)
    
    

    In [57]:

    df.groupby(by='item')['price'].apply(myMean) #apply充当聚合的运算工具
    
    

    Out[57]:

    item
    Apple     3.00
    Banana    2.75
    Orange    3.50
    Name: price, dtype: float64
    
    

    In [58]:

    df.groupby(by='item')['price'].transform(myMean) #apply充当聚合的运算工具
    
    

    Out[58]:

    0    3.00
    1    2.75
    2    3.50
    3    2.75
    4    3.50
    5    3.00
    Name: price, dtype: float64
    
    

    数据加载

    • 读取type-.txt文件数据

    In [1]:

    import pandas as pd
    from pandas import DataFrame,Series
    data=pd.read_csv('./data/type-.txt')
    data
    
    

    Out[1]:

    你好-我好-他也好
    0 也许-大概-有可能
    1 然而-未必-不见得

    In [2]:

    data.shape
    
    

    Out[2]:

    (2, 1)
    
    
    • 将文件中每一个词作为元素存放在DataFrame中

    In [4]:

    data=pd.read_csv('./data/type-.txt',sep='-',header=None)
    data
    
    

    Out[4]:

    0 1 2
    0 你好 我好 他也好
    1 也许 大概 有可能
    2 然而 未必 不见得
    • 读取数据库中的数据

    In [6]:

    #连接数据库,获取连接对象
    import sqlite3 as sqlite3
    conn=sqlite3.connect('./data/weather_2012.sqlite')
    
    

    In [7]:

    #读取库表中的数据值
    sql_df=pd.read_sql('select * from weather_2012',conn)
    sql_df
    
    

    . . .

    In [ ]:

    #将一个df中的数据值写入存储到db
    data_1.to_sql('sql_data',conn)
    
    

    透视表

    • 透视表是一种可以对数据动态排布并且分类汇总的表格格式。或许大多数人都在Excel使用过数据透视表,也体会到它的强大功能,而在pandas中它被称作pivot_table。

    • 透视表的优点:

      • 灵活性高,可以随意定制你的分析计算要求
      • 脉络清晰易于理解数据
      • 操作性强,报表神器

    In [16]:

    import pandas as pd
    import numpy as np
    df = pd.read_csv('./data/透视表-篮球赛.csv',encoding='utf8')
    df.head()
    
    

    Out[16]:

    对手 胜负 主客场 命中 投篮数 投篮命中率 3分命中率 篮板 助攻 得分
    0 勇士 10 23 0.435 0.444 6 11 27
    1 国王 8 21 0.381 0.286 3 9 27
    2 小牛 10 19 0.526 0.462 3 7 29
    3 灰熊 8 20 0.400 0.250 5 8 22
    4 76人 10 20 0.500 0.250 3 13 27

    pivot_table有四个最重要的参数index、values、columns、aggfunc

    • index参数:分类汇总的分类条件
      • 每个pivot_table必须拥有一个index。如果想查看哈登对阵每个队伍的得分则需要对每一个对进行分类并计算其各类得分的平均值:

    In [14]:

    df.pivot_table(index='对手')
    
    

    . . .

    • 想看看对阵同一对手在不同主客场下的数据,分类条件为对手和主客场

    In [17]:

    df.pivot_table(index=['对手','主客场'])
    
    

    . . .

    • values参数:需要对计算的数据进行筛选
      • 如果我们只需要哈登在主客场和不同胜负情况下的得分、篮板与助攻三项数据:

    In [18]:

    df.pivot_table(index=['主客场','胜负'],values=['得分','助攻','篮板'])
    
    

    Out[18]:

    助攻 得分 篮板
    主客场 胜负
    10.555556 34.222222 5.444444
    8.666667 29.666667 5.000000
    9.000000 32.000000 4.916667
    8.000000 20.000000 4.000000
    • Aggfunc参数:设置我们对数据聚合时进行的函数操作

      • 当我们未设置aggfunc时,它默认aggfunc='mean'计算均值。
    • 还想获得james harden在主客场和不同胜负情况下的总得分、总篮板、总助攻时:

    In [20]:

    df.pivot_table(index=['主客场','胜负'],values=['得分','助攻','篮板'],aggfunc=['sum','mean'])
    
    

    Out[20]:

    sum mean
    助攻 得分 篮板 助攻 得分 篮板
    主客场 胜负
    95 308 49 10.555556 34.222222 5.444444
    26 89 15 8.666667 29.666667 5.000000
    108 384 59 9.000000 32.000000 4.916667
    8 20 4 8.000000 20.000000 4.000000
    • Columns:可以设置列层次字段
      • 对values字段进行分类

    In [33]:

    #获取所有队主客场的总得分
    df.pivot_table(index='主客场',values='得分',aggfunc='sum',fill_value=0)
    
    

    Out[33]:

    得分
    主客场
    397
    404

    In [34]:

    #获取每个队主客场的总得分(在总得分的基础上又进行了对手的分类)
    df.pivot_table(index='主客场',values='得分',columns='对手',aggfunc='sum',fill_value=0)
    
    

    Out[34]:

    对手 76人 勇士 国王 太阳 小牛 尼克斯 开拓者 掘金 步行者 湖人 灰熊 爵士 猛龙 篮网 老鹰 骑士 鹈鹕 黄蜂
    主客场
    29 0 0 0 29 37 0 21 29 0 60 56 38 37 0 35 26 0
    27 27 27 48 0 31 48 0 26 36 49 29 0 0 29 0 0 27

    交叉表

    • 是一种用于计算分组的特殊透视图,对数据进行汇总
    • pd.crosstab(index,colums)
      • index:分组数据,交叉表的行索引
      • columns:交叉表的列索引

    In [35]:

    df = DataFrame({'sex':['man','man','women','women','man','women','man','women','women'],
                   'age':[15,23,25,17,35,57,24,31,22],
                   'smoke':[True,False,False,True,True,False,False,True,False],
                   'height':[168,179,181,166,173,178,188,190,160]})
    
    
    • 求出各个性别抽烟的人数

    In [36]:

    pd.crosstab(df.smoke,df.sex)
    
    

    Out[36]:

    sex man women
    smoke
    False 2 3
    True 2 2
    • 求出各个年龄段抽烟人情况

    In [14]:

    pd.crosstab(df.age,df.smoke)
    
    

    Out[14]:

    smoke False True
    age
    15 0 1
    17 0 1
    22 1 0
    23 1 0
    24 1 0
    25 1 0
    31 0 1
    35 0 1
    57 1 0
  • 相关阅读:
    实验四 (1):定义一个形状类(Shape)方法:计算周长,计算面积
    计算机结构的简化模型
    memcached性能测试之Twemperf
    Eclipse UML小工具AmaterasUML的配置和使用
    Kafka中文官方文档
    HBase Snapshot简介
    shell脚本学习系列之一---入门
    linux中shell变量$#,$@,$0,$1,$2的含义解释
    vim常用操作
    Linux多台主机间配置SSH免密登陆
  • 原文地址:https://www.cnblogs.com/bky20061005/p/12233237.html
Copyright © 2011-2022 走看看