zoukankan      html  css  js  c++  java
  • Pandas CookBook -- 05布尔索引

    布尔索引

    简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧

    import pandas as pd
    import numpy as np
    

    设定最大列数和最大行数

    pd.set_option('max_columns',5 , 'max_rows', 5)
    

    1 布尔值统计信息

    movie = pd.read_csv('data/movie.csv', index_col='movie_title')
    
    movie.head()
    
    color director_name ... aspect_ratio movie_facebook_likes
    movie_title
    Avatar Color James Cameron ... 1.78 33000
    Pirates of the Caribbean: At World's End Color Gore Verbinski ... 2.35 0
    Spectre Color Sam Mendes ... 2.35 85000
    The Dark Knight Rises Color Christopher Nolan ... 2.35 164000
    Star Wars: Episode VII - The Force Awakens NaN Doug Walker ... NaN 0

    5 rows × 27 columns

    1.1 基础方法

    判断电影时长是否超过两小时

    movie_2_hours = movie['duration'] > 120
    movie_2_hours.head(10)
    
    movie_title
    Avatar                                      True
    Pirates of the Caribbean: At World's End    True
                                                ... 
    Avengers: Age of Ultron                     True
    Harry Potter and the Half-Blood Prince      True
    Name: duration, Length: 10, dtype: bool
    

    有多少时长超过两小时的电影

    movie_2_hours.sum()
    
    1039
    

    超过两小时的电影的比例

    movie_2_hours.mean()
    
    0.2113506916192026
    

    实际上,dureation这列是有缺失值的,要想获得真正的超过两小时的电影的比例,需要先删掉缺失值

    movie['duration'].dropna().gt(120).mean()
    
    0.21199755152009794
    

    1.2 统计信息

    用describe()输出一些该布尔Series信息

    movie_2_hours.describe()
    
    count      4916
    unique        2
    top       False
    freq       3877
    Name: duration, dtype: object
    

    统计False和True值的比例

     movie_2_hours.value_counts(normalize=True)
    
    False    0.788649
    True     0.211351
    Name: duration, dtype: float64
    

    2 布尔索引

    2.1 布尔条件

    在Pandas中,位运算符(&, |, ~)的优先级高于比较运算符

    2.1.1 创建多个布尔条件

    criteria1 = movie.imdb_score > 8
    criteria2 = movie.content_rating == 'PG-13'
    criteria3 = (movie.title_year < 2000) | (movie.title_year >= 2010)
    
    criteria3.head()
    
    movie_title
    Avatar                                        False
    Pirates of the Caribbean: At World's End      False
    Spectre                                        True
    The Dark Knight Rises                          True
    Star Wars: Episode VII - The Force Awakens    False
    Name: title_year, dtype: bool
    

    2.1.2 将这些布尔条件合并成一个

    criteria_final = criteria1 & criteria2 & criteria3
    
    criteria_final.head()
    
    movie_title
    Avatar                                        False
    Pirates of the Caribbean: At World's End      False
    Spectre                                       False
    The Dark Knight Rises                          True
    Star Wars: Episode VII - The Force Awakens    False
    dtype: bool
    

    2.2 布尔过滤

    创建第一个布尔条件

     crit_a1 = movie.imdb_score > 8
     crit_a2 = movie.content_rating == 'PG-13'
     crit_a3 = (movie.title_year < 2000) | (movie.title_year > 2009)
     final_crit_a = crit_a1 & crit_a2 & crit_a3
    

    创建第二个布尔条件

    crit_b1 = movie.imdb_score < 5
    crit_b2 = movie.content_rating == 'R'
    crit_b3 = (movie.title_year >= 2000) & (movie.title_year <= 2010)
    final_crit_b = crit_b1 & crit_b2 & crit_b3
    

    合并布尔条件

    final_crit_all = final_crit_a | final_crit_b
    
    final_crit_all.head()
    
    movie_title
    Avatar                                        False
    Pirates of the Caribbean: At World's End      False
    Spectre                                       False
    The Dark Knight Rises                          True
    Star Wars: Episode VII - The Force Awakens    False
    dtype: bool
    

    过滤数据

    movie[final_crit_all].head()
    
    color director_name ... aspect_ratio movie_facebook_likes
    movie_title
    The Dark Knight Rises Color Christopher Nolan ... 2.35 164000
    The Avengers Color Joss Whedon ... 1.85 123000
    Captain America: Civil War Color Anthony Russo ... 2.35 72000
    Guardians of the Galaxy Color James Gunn ... 2.35 96000
    Interstellar Color Christopher Nolan ... 2.35 349000

    5 rows × 27 columns

    验证过滤

    cols = ['imdb_score', 'content_rating', 'title_year']
    movie_filtered = movie.loc[final_crit_all, cols]
    movie_filtered.head(10)
    
    imdb_score content_rating title_year
    movie_title
    The Dark Knight Rises 8.5 PG-13 2012.0
    The Avengers 8.1 PG-13 2012.0
    ... ... ... ...
    Sex and the City 2 4.3 R 2010.0
    Rollerball 3.0 R 2002.0

    10 rows × 3 columns

    2.3 与标签索引对比

    college = pd.read_csv('data/college.csv')
    college2 = college.set_index('STABBR')
    

    2.3.1 单个标签

    college2中STABBR作为行索引,用loc选取

    college2.loc['TX'].head()
    
    INSTNM CITY ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
    STABBR
    TX Abilene Christian University Abilene ... 40200 25985
    TX Alvin Community College Alvin ... 34500 6750
    TX Amarillo College Amarillo ... 31700 10950
    TX Angelina College Lufkin ... 26900 PrivacySuppressed
    TX Angelo State University San Angelo ... 37700 21319.5

    5 rows × 26 columns

    college中,用布尔索引选取所有得克萨斯州的学校

    college[college['STABBR'] == 'TX'].head()
    
    INSTNM CITY ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
    3610 Abilene Christian University Abilene ... 40200 25985
    3611 Alvin Community College Alvin ... 34500 6750
    3612 Amarillo College Amarillo ... 31700 10950
    3613 Angelina College Lufkin ... 26900 PrivacySuppressed
    3614 Angelo State University San Angelo ... 37700 21319.5

    5 rows × 27 columns

    比较二者的速度

    法一

    %timeit college[college['STABBR'] == 'TX']
    
    937 µs ± 58.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    

    法二

    %timeit college2.loc['TX']
    
    520 µs ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    %timeit college2 = college.set_index('STABBR')
    
    2.11 ms ± 185 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    2.3.2 多个标签

    布尔索引和标签选取多列

    states =['TX', 'CA', 'NY']
    
    college[college['STABBR'].isin(states)]
    
    INSTNM CITY ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
    192 Academy of Art University San Francisco ... 36000 35093
    193 ITT Technical Institute-Rancho Cordova Rancho Cordova ... 38800 25827.5
    ... ... ... ... ... ...
    7533 Bay Area Medical Academy - San Jose Satellite ... San Jose ... NaN PrivacySuppressed
    7534 Excel Learning Center-San Antonio South San Antonio ... NaN 12125

    1704 rows × 27 columns

    college2.loc[states].head()
    
    INSTNM CITY ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
    STABBR
    TX Abilene Christian University Abilene ... 40200 25985
    TX Alvin Community College Alvin ... 34500 6750
    TX Amarillo College Amarillo ... 31700 10950
    TX Angelina College Lufkin ... 26900 PrivacySuppressed
    TX Angelo State University San Angelo ... 37700 21319.5

    5 rows × 26 columns

    3 查询方法

    使用查询方法提高布尔索引的可读性

    # 读取employee数据,确定选取的部门和列
    
    employee = pd.read_csv('data/employee.csv')
    depts = ['Houston Police Department-HPD', 'Houston Fire Department (HFD)']
    select_columns = ['UNIQUE_ID', 'DEPARTMENT', 'GENDER', 'BASE_SALARY']
    
    # 创建查询字符串,并执行query方法
    
    qs = "DEPARTMENT in @depts and GENDER == 'Female' and 80000 <= BASE_SALARY <= 120000"
    
    emp_filtered = employee.query(qs)
    emp_filtered[select_columns].head()
    
    UNIQUE_ID DEPARTMENT GENDER BASE_SALARY
    61 61 Houston Fire Department (HFD) Female 96668.0
    136 136 Houston Police Department-HPD Female 81239.0
    367 367 Houston Police Department-HPD Female 86534.0
    474 474 Houston Police Department-HPD Female 91181.0
    513 513 Houston Police Department-HPD Female 81239.0

    4 唯一和有序索引

    4.1 单列索引

    college = pd.read_csv('data/college.csv')
    college2 = college.set_index('STABBR')
    
    college2.index.is_monotonic
    
    False
    

    将college2排序,存储成另一个对象,查看其是否有序

    college3 = college2.sort_index()
    college3.index.is_monotonic
    
    True
    

    使用INSTNM作为行索引,检测行索引是否唯一

    college_unique = college.set_index('INSTNM')
    
    college_unique.index.is_unique
    
    True
    

    4.2 拼装索引

    使用CITY和STABBR两列作为行索引,并进行排序

    college.index = college['CITY'] + ', ' + college['STABBR']
    
    college = college.sort_index()
    
    college.head()
    
    INSTNM CITY ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
    ARTESIA, CA Angeles Institute ARTESIA ... NaN 16850
    Aberdeen, SD Presentation College Aberdeen ... 35900 25000
    Aberdeen, SD Northern State University Aberdeen ... 33600 24847
    Aberdeen, WA Grays Harbor College Aberdeen ... 27000 11490
    Abilene, TX Hardin-Simmons University Abilene ... 38700 25864

    5 rows × 27 columns

    college.index.is_unique
    
    False
    

    选取所有Miami, FL的大学

    法一

    college.loc['Miami, FL'].head()
    
    INSTNM CITY ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
    Miami, FL New Professions Technical Institute Miami ... 18700 8682
    Miami, FL Management Resources College Miami ... PrivacySuppressed 12182
    Miami, FL Strayer University-Doral Miami ... 49200 36173.5
    Miami, FL Keiser University- Miami Miami ... 29700 26063
    Miami, FL George T Baker Aviation Technical College Miami ... 38600 PrivacySuppressed

    5 rows × 27 columns

    法二

    crit1 = college['CITY'] == 'Miami' 
    crit2 = college['STABBR'] == 'FL'
    college[crit1 & crit2]
    
    INSTNM CITY ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
    Miami, FL New Professions Technical Institute Miami ... 18700 8682
    Miami, FL Management Resources College Miami ... PrivacySuppressed 12182
    ... ... ... ... ... ...
    Miami, FL Advanced Technical Centers Miami ... PrivacySuppressed PrivacySuppressed
    Miami, FL Lindsey Hopkins Technical College Miami ... 29800 PrivacySuppressed

    50 rows × 27 columns

    5 loc/iloc中使用布尔

    movie = pd.read_csv('data/movie.csv', index_col='movie_title')
    

    5.1 行

    c1 = movie['content_rating'] == 'G'
    c2 = movie['imdb_score'] < 4
    criteria = c1 & c2
    
    bool_movie = movie[criteria]
    bool_movie
    
    color director_name ... aspect_ratio movie_facebook_likes
    movie_title
    The True Story of Puss'N Boots Color Jérôme Deschamps ... NaN 90
    Doogal Color Dave Borthwick ... 1.85 346
    ... ... ... ... ... ...
    Justin Bieber: Never Say Never Color Jon M. Chu ... 1.85 62000
    Sunday School Musical Color Rachel Goldenberg ... 1.85 777

    6 rows × 27 columns

    loc使用bool

    法一

    movie_loc = movie.loc[criteria]
    

    检查loc条件和布尔条件创建出来的两个DataFrame是否一样

    movie_loc.equals(movie[criteria])
    
    True
    

    法二

    movie_loc2 = movie.loc[criteria.values]
    
    movie_loc2.equals(movie[criteria])
    
    True
    

    iloc使用bool

    因为criteria是包含行索引的一个Series,必须要使用底层的ndarray,才能使用,iloc

    movie_iloc = movie.iloc[criteria.values]
    
    movie_iloc.equals(movie_loc)
    
    True
    

    5.2 列

    布尔索引也可以用来选取列

    criteria_col = movie.dtypes == np.int64
    criteria_col.head()
    
    color                      False
    director_name              False
    num_critic_for_reviews     False
    duration                   False
    director_facebook_likes    False
    dtype: bool
    
    movie.loc[:, criteria_col].head()
    
    num_voted_users cast_total_facebook_likes movie_facebook_likes
    movie_title
    Avatar 886204 4834 33000
    Pirates of the Caribbean: At World's End 471220 48350 0
    Spectre 275868 11700 85000
    The Dark Knight Rises 1144337 106759 164000
    Star Wars: Episode VII - The Force Awakens 8 143 0
    movie.iloc[:, criteria_col.values].head()
    
    num_voted_users cast_total_facebook_likes movie_facebook_likes
    movie_title
    Avatar 886204 4834 33000
    Pirates of the Caribbean: At World's End 471220 48350 0
    Spectre 275868 11700 85000
    The Dark Knight Rises 1144337 106759 164000
    Star Wars: Episode VII - The Force Awakens 8 143 0

    6 使用布尔值 - where/mask

    mask() is the inverse boolean operation of where.

    DataFrame.where(cond, other=nan, inplace=False **kwgs)
    Parameters:

    • cond : boolean NDFrame, array-like, or callable

      • Where cond is True, keep the original value. Where False, replace with corresponding value from other. If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).
      • cond是一个与df通型的dataframe,当dataframe与cond对应的位置是true是,保留原值。否则便为other对应的值
    • other : scalar, NDFrame, or callable

    • inplace : boolean, default False

      • Whether to perform the operation in place on the data

    6.1 Series使用where

    movie = pd.read_csv('data/movie.csv', index_col='movie_title')
    fb_likes = movie['actor_1_facebook_likes'].dropna()
    fb_likes.head()
    
    movie_title
    Avatar                                         1000.0
    Pirates of the Caribbean: At World's End      40000.0
    Spectre                                       11000.0
    The Dark Knight Rises                         27000.0
    Star Wars: Episode VII - The Force Awakens      131.0
    Name: actor_1_facebook_likes, dtype: float64
    

    使用describe获得对数据的认知

    fb_likes.describe(percentiles=[.1, .25, .5, .75, .9]).astype(int)
    
    count      4909
    mean       6494
              ...  
    90%       18000
    max      640000
    Name: actor_1_facebook_likes, Length: 10, dtype: int64
    

    检测小于20000个喜欢的的比例

    criteria_high = fb_likes < 20000
    
    criteria_high.mean().round(2)
    
    0.91
    

    where条件可以返回一个同样大小的Series,但是所有False会被替换成缺失值

    fb_likes.where(criteria_high).head()
    
    movie_title
    Avatar                                         1000.0
    Pirates of the Caribbean: At World's End          NaN
    Spectre                                       11000.0
    The Dark Knight Rises                             NaN
    Star Wars: Episode VII - The Force Awakens      131.0
    Name: actor_1_facebook_likes, dtype: float64
    

    第二个参数other,可以让你控制替换值

    fb_likes.where(criteria_high, other=20000).head()
    
    movie_title
    Avatar                                         1000.0
    Pirates of the Caribbean: At World's End      20000.0
    Spectre                                       11000.0
    The Dark Knight Rises                         20000.0
    Star Wars: Episode VII - The Force Awakens      131.0
    Name: actor_1_facebook_likes, dtype: float64
    

    通过where条件,设定上下限的值

    criteria_low = fb_likes > 300
    fb_likes_cap = fb_likes.where(criteria_high, other=20000).where(criteria_low, 300)
    fb_likes_cap.head()
    
    movie_title
    Avatar                                         1000.0
    Pirates of the Caribbean: At World's End      20000.0
    Spectre                                       11000.0
    The Dark Knight Rises                         20000.0
    Star Wars: Episode VII - The Force Awakens      300.0
    Name: actor_1_facebook_likes, dtype: float64
    

    原始Series和修改过的Series的长度是一样的

    len(fb_likes), len(fb_likes_cap)
    
    (4909, 4909)
    

    6.2 dataframe使用where

    df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'n'],'ids2': ['a', 'n', 'c', 'n']})
    
    print(df)
    print(df < 2)
    df.where(df<2,1000)
    
       vals ids ids2
    0     1   a    a
    1     2   b    n
    2     3   f    c
    3     4   n    n
        vals   ids  ids2
    0   True  True  True
    1  False  True  True
    2  False  True  True
    3  False  True  True
    
    vals ids ids2
    0 1 a a
    1 1000 b n
    2 1000 f c
    3 1000 n n

    下面的代码等价于 df.where(df < 0,1000).

    print(df[df < 2])
    df[df < 2].fillna(1000)
    
       vals ids ids2
    0   1.0   a    a
    1   NaN   b    n
    2   NaN   f    c
    3   NaN   n    n
    
    vals ids ids2
    0 1.0 a a
    1 1000.0 b n
    2 1000.0 f c
    3 1000.0 n n
    天下风云出我辈,一入江湖岁月催
  • 相关阅读:
    linux addr2line 定位so库崩溃位置
    转:关于Android机型适配这件小事儿
    转:android studio 改编译区背景色
    转:ios review推送与执行
    k2pdfopt下载页
    转:让kindle更好的支持pdf
    转:各种文本格式转换的网站
    转: iOS崩溃堆栈符号表使用与用途
    转: 腾讯Bugly干货分享:Android应用性能评测调优
    转: git的图文使用教程(巨详细)
  • 原文地址:https://www.cnblogs.com/shiyushiyu/p/9742808.html
Copyright © 2011-2022 走看看