zoukankan      html  css  js  c++  java
  • Pandas CookBook -- 04选取数据子集

    选取数据子集

    简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧

    import pandas as pd
    import numpy as np
    

    设定最大列数和最大行数

    pd.set_option('max_columns',5 , 'max_rows', 5)
    

    1 选取Series数据

    college = pd.read_csv('data/college.csv', index_col='INSTNM')
    
    city = college['CITY']
    

    1.1 iloc整数选取

    标量

    city.iloc[3]
    
    'Huntsville'
    

    多行 - 整数列表

    city.iloc[[10,20,30]]
    
    INSTNM
    Birmingham Southern College                            Birmingham
    George C Wallace State Community College-Hanceville    Hanceville
    Judson College                                             Marion
    Name: CITY, dtype: object
    

    多行 - 整数切片

    city.iloc[4:50:10]
    
    INSTNM
    Alabama State University              Montgomery
    Enterprise State Community College    Enterprise
    Heritage Christian University           Florence
    Marion Military Institute                 Marion
    Reid State Technical College           Evergreen
    Name: CITY, dtype: object
    

    1.2 loc标签选取

    标量

    city.loc['Heritage Christian University']
    
    'Florence'
    

    多行 - 标签列表

    随机选择4个标签

    labels = list(np.random.choice(city.index, 4))
    labels
    
    ['University of St Thomas',
     'Paul Mitchell the School-Woodbridge',
     'San Francisco Conservatory of Music',
     'Trinity Bible College']
    
    city.loc[labels]
    
    INSTNM
    University of St Thomas                   Saint Paul
    Paul Mitchell the School-Woodbridge       Woodbridge
    San Francisco Conservatory of Music    San Francisco
    Trinity Bible College                      Ellendale
    Name: CITY, dtype: object
    

    多行 - 标签切片

     city.loc['Alabama State University':'Reid State Technical College':10]
    
    INSTNM
    Alabama State University              Montgomery
    Enterprise State Community College    Enterprise
    Heritage Christian University           Florence
    Marion Military Institute                 Marion
    Reid State Technical College           Evergreen
    Name: CITY, dtype: object
    

    也可以切片逆序选取

    city.loc['Reid State Technical College':'Alabama State University':-10]
    
    INSTNM
    Reid State Technical College           Evergreen
    Marion Military Institute                 Marion
    Heritage Christian University           Florence
    Enterprise State Community College    Enterprise
    Alabama State University              Montgomery
    Name: CITY, dtype: object
    

    2 选取DataFrame数据

    2.1 选取DataFrame的行

    注意iloc是整数索引,loc是标签索引即可

    2.2 同时选取DataFrame的行和列

    2.2.1 选取前3行和前4列

    college.iloc[:3, :4]
    
    CITY STABBR HBCU MENONLY
    INSTNM
    Alabama A & M University Normal AL 1.0 0.0
    University of Alabama at Birmingham Birmingham AL 0.0 0.0
    Amridge University Montgomery AL 0.0 0.0
    college.loc[:'Amridge University', :'MENONLY']
    
    CITY STABBR HBCU MENONLY
    INSTNM
    Alabama A & M University Normal AL 1.0 0.0
    University of Alabama at Birmingham Birmingham AL 0.0 0.0
    Amridge University Montgomery AL 0.0 0.0

    2.2.2 选取两列的所有的行

    college.iloc[:, [4,6]].head()
    
    WOMENONLY SATVRMID
    INSTNM
    Alabama A & M University 0.0 424.0
    University of Alabama at Birmingham 0.0 570.0
    Amridge University 0.0 NaN
    University of Alabama in Huntsville 0.0 595.0
    Alabama State University 0.0 425.0
    college.loc[:, ['WOMENONLY', 'SATVRMID']]
    
    WOMENONLY SATVRMID
    INSTNM
    Alabama A & M University 0.0 424.0
    University of Alabama at Birmingham 0.0 570.0
    ... ... ...
    Bay Area Medical Academy - San Jose Satellite Location NaN NaN
    Excel Learning Center-San Antonio South NaN NaN

    7535 rows × 2 columns

    2.2.3 选取不连续的行和列

    college.iloc[[100, 200], [7, 15]]
    
    SATMTMID UGDS_NHPI
    INSTNM
    GateWay Community College NaN 0.0029
    American Baptist Seminary of the West NaN NaN
    rows = ['GateWay Community College', 'American Baptist Seminary of the West']
    columns = ['SATMTMID', 'UGDS_NHPI']
    college.loc[rows, columns]
    
    SATMTMID UGDS_NHPI
    INSTNM
    GateWay Community College NaN 0.0029
    American Baptist Seminary of the West NaN NaN

    2.2.4 选取一个标量值

    college.iloc[5, -4]
    
    0.401
    
    college.loc['The University of Alabama', 'PCTFLOAN']
    
    0.401
    

    2.2.5 行切片选取一列

    college.iloc[90:80:-2, 5]
    
    INSTNM
    Empire Beauty School-Flagstaff     0
    Charles of Italy Beauty College    0
    Central Arizona College            0
    University of Arizona              0
    Arizona State University-Tempe     0
    Name: RELAFFIL, dtype: int64
    
    start = 'Empire Beauty School-Flagstaff'
    stop = 'Arizona State University-Tempe'
    college.loc[start:stop:-2, 'RELAFFIL']
    
    INSTNM
    Empire Beauty School-Flagstaff     0
    Charles of Italy Beauty College    0
    Central Arizona College            0
    University of Arizona              0
    Arizona State University-Tempe     0
    Name: RELAFFIL, dtype: int64
    

    3 用整数和标签的互换

    用索引方法get_loc,找到指定列的整数位置

    col_start = college.columns.get_loc('UGDS_WHITE')
    col_end = college.columns.get_loc('UGDS_UNKN') + 1
    col_start, col_end
    
    (10, 19)
    
    college.iloc[:5, col_start:col_end]
    
    UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
    INSTNM
    Alabama A & M University 0.0333 0.9353 ... 0.0059 0.0138
    University of Alabama at Birmingham 0.5922 0.2600 ... 0.0179 0.0100
    Amridge University 0.2990 0.4192 ... 0.0000 0.2715
    University of Alabama in Huntsville 0.6988 0.1255 ... 0.0332 0.0350
    Alabama State University 0.0158 0.9208 ... 0.0243 0.0137

    5 rows × 9 columns

    获得整数行对应的标签名

    row_end = college.index[4]
    col_start,col_end = college.columns[10],college.columns[19]
    col_start,col_end
    
    ('UGDS_WHITE', 'PPTUG_EF')
    
    college.loc[:row_end, col_start:col_end]
    
    UGDS_WHITE UGDS_BLACK ... UGDS_UNKN PPTUG_EF
    INSTNM
    Alabama A & M University 0.0333 0.9353 ... 0.0138 0.0656
    University of Alabama at Birmingham 0.5922 0.2600 ... 0.0100 0.2607
    Amridge University 0.2990 0.4192 ... 0.2715 0.4536
    University of Alabama in Huntsville 0.6988 0.1255 ... 0.0350 0.2146
    Alabama State University 0.0158 0.9208 ... 0.0137 0.0892

    5 rows × 10 columns

    4 切片操作

    4.1 惰性行切片

    惰性,我的理解就就是省去loc和iloc吧,但是惰性切片不能用于列,只能用于DataFrame的行和Series,也不能同时选取行和列。

    Series选取数据

    city = college['CITY']
    
    city[10:20:2]
    
    INSTNM
    Birmingham Southern College              Birmingham
    Concordia College Alabama                     Selma
    Enterprise State Community College       Enterprise
    Faulkner University                      Montgomery
    New Beginning College of Cosmetology    Albertville
    Name: CITY, dtype: object
    

    dataframe行选取

    college[10:20:2]
    
    CITY STABBR ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
    INSTNM
    Birmingham Southern College Birmingham AL ... 44200 27000
    Concordia College Alabama Selma AL ... 19900 PrivacySuppressed
    Enterprise State Community College Enterprise AL ... 24600 8273
    Faulkner University Montgomery AL ... 37200 22000
    New Beginning College of Cosmetology Albertville AL ... NaN 5500

    5 rows × 26 columns

    Series和DataFrame都可以用标签进行切片。

    start = 'Mesa Community College'
    stop = 'Spokane Community College'
    college[start:stop:1500]
    
    CITY STABBR ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
    INSTNM
    Mesa Community College Mesa AZ ... 35200 8000
    Hair Academy Inc-New Carrollton New Carrollton MD ... 15200 9666
    National College of Natural Medicine Portland OR ... NaN PrivacySuppressed

    3 rows × 26 columns

    4.2 按照字母切片

    需先对索引进行排序

    college = college.sort_index()
    college.head()
    
    CITY STABBR ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
    INSTNM
    A & W Healthcare Educators New Orleans LA ... NaN 19022.5
    A T Still University of Health Sciences Kirksville MO ... 219800 PrivacySuppressed
    ABC Beauty Academy Garland TX ... NaN PrivacySuppressed
    ABC Beauty College Inc Arkadelphia AR ... PrivacySuppressed 16500
    AI Miami International University of Art and Design Miami FL ... 29900 31000

    5 rows × 26 columns

    选取字母顺序在‘Sp’和‘Su’之间的学校

    college.loc['Sp':'Su']
    
    CITY STABBR ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
    INSTNM
    Spa Tech Institute-Ipswich Ipswich MA ... 21500 6333
    Spa Tech Institute-Plymouth Plymouth MA ... 21500 6333
    ... ... ... ... ... ...
    Styles and Profiles Beauty College Selmer TN ... PrivacySuppressed PrivacySuppressed
    Styletrends Barber and Hairstyling Academy Rock Hill SC ... PrivacySuppressed 9495.5

    201 rows × 26 columns

    天下风云出我辈,一入江湖岁月催
  • 相关阅读:
    CentOS 6.2安装Darwin Streaming Server
    流媒体技术笔记(协议相关)
    流媒体技术笔记(视频编码相关)
    CentOS6.2下编译mpeg4ip
    用popen函数操作其它程序的输入和输出
    给centos6.2安装yum源
    启动新进程(fork和exec系列函数实现)
    扩展Asterisk1.8.7的Dialplan Applications
    源码安装ffmpeg(带libx264)
    扩展Asterisk1.8.7的AMI接口
  • 原文地址:https://www.cnblogs.com/shiyushiyu/p/9742232.html
Copyright © 2011-2022 走看看