zoukankan      html  css  js  c++  java
  • 萌新向Python数据分析及数据挖掘 第二章 pandas 第一节 pandas使用基础Q&A 1-15

     

    Python pandas Q&A video series by Data School

    YouTube playlist and GitHub repository

    Table of contents

    1. What is pandas?
    2. How do I read a tabular data file into pandas?
    3. How do I select a pandas Series from a DataFrame?
    4. Why do some pandas commands end with parentheses (and others don't)?
    5. How do I rename columns in a pandas DataFrame?
    6. How do I remove columns from a pandas DataFrame?
    7. How do I sort a pandas DataFrame or a Series?
    8. How do I filter rows of a pandas DataFrame by column value?
    9. How do I apply multiple filter criteria to a pandas DataFrame?
    10. Your pandas questions answered!
    11. How do I use the "axis" parameter in pandas?
    12. How do I use string methods in pandas?
    13. How do I change the data type of a pandas Series?
    14. When should I use a "groupby" in pandas?
    15. How do I explore a pandas Series?
    16. How do I handle missing values in pandas?
    17. What do I need to know about the pandas index? (Part 1)
    18. What do I need to know about the pandas index? (Part 2)
    19. How do I select multiple rows and columns from a pandas DataFrame?
    20. When should I use the "inplace" parameter in pandas?
    21. How do I make my pandas DataFrame smaller and faster?
    22. How do I use pandas with scikit-learn to create Kaggle submissions?
    23. More of your pandas questions answered!
    24. How do I create dummy variables in pandas?
    25. How do I work with dates and times in pandas?
    26. How do I find and remove duplicate rows in pandas?
    27. How do I avoid a SettingWithCopyWarning in pandas?
    28. How do I change display options in pandas?
    29. How do I create a pandas DataFrame from another object?
    30. How do I apply a function to a pandas Series or DataFrame?
    In [1]:
     
     
     
     
     
    # 传统方式
    import pandas as pd
     
     
     

    2. How do I read a tabular data file into pandas? (video)

    In [2]:
     
     
     
     
     
    # 直接从URL中读取Chipotle订单的数据集,并将结果存储在数据库中
    url1 = "https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/chipotle.tsv" 
    #定义地址
    orders =pd.read_table(url1)#使用read_table()打开
     
     
    In [3]:
     
     
     
     
     
    # 检查前5行
    orders.head()
     
     
    Out[3]:
     order_idquantityitem_namechoice_descriptionitem_price
    0 1 1 Chips and Fresh Tomato Salsa NaN $2.39
    1 1 1 Izze [Clementine] $3.39
    2 1 1 Nantucket Nectar [Apple] $3.39
    3 1 1 Chips and Tomatillo-Green Chili Salsa NaN $2.39
    4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98
     

    Documentation for read_table

    In [4]:
     
     
     
     
     
    # 读取电影评论员的数据集(修改read_table的默认参数值)
    user_cols = ['user_id','age','gender','occupation','zipcode']#定义列名
    url2 = "https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/u.user"
    #定义地址
    #users=pd.read_table(url2,sep='|',header=None,names= user_clos,skiprows=2,skipfooter=3)
    users = pd.read_table('http://bit.ly/movieusers', sep='|', header=None, names=user_cols)
    #加入参数sep 分隔符,header 头部 标题,names 列名
     
     
    In [5]:
     
     
     
     
     
    # 检查前5行
    users.head()
     
     
    Out[5]:
     user_idagegenderoccupationzipcode
    0 1 24 M technician 85711
    1 2 53 F other 94043
    2 3 23 M writer 32067
    3 4 24 M technician 43537
    4 5 33 F other 15213
     

    3. How do I select a pandas Series from a DataFrame? (video)

    In [6]:
     
     
     
     
     
    # 将UFO报告的数据集读入DataFrame
    url3 = "https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv"#定义列名
    ufo = pd.read_table(url3, sep=',')
     
     
    In [7]:
     
     
     
     
     
    # #用read_table打开csv文件,区别是 read_csv直接是用逗号隔开
    ufo = pd.read_csv(url3)
     
     
    In [8]:
     
     
     
     
     
    #  检查前5行
    ufo.head()
     
     
    Out[8]:
     CityColors ReportedShape ReportedStateTime
    0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
    1 Willingboro NaN OTHER NJ 6/30/1930 20:00
    2 Holyoke NaN OVAL CO 2/15/1931 14:00
    3 Abilene NaN DISK KS 6/1/1931 13:00
    4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00
    In [9]:
     
     
     
     
     
    # #用括号法查看Series
    ufo['City']
    # #用点法查看Series,要注意 名字里面有空格或者是python专有字符的时候不能用,但是方便输入
    ufo.City
     
     
    Out[9]:
    0                      Ithaca
    1                 Willingboro
    2                     Holyoke
    3                     Abilene
    4        New York Worlds Fair
    5                 Valley City
    6                 Crater Lake
    7                        Alma
    8                     Eklutna
    9                     Hubbard
    10                    Fontana
    11                   Waterloo
    12                     Belton
    13                     Keokuk
    14                  Ludington
    15                Forest Home
    16                Los Angeles
    17                  Hapeville
    18                     Oneida
    19                 Bering Sea
    20                   Nebraska
    21                        NaN
    22                        NaN
    23                  Owensboro
    24                 Wilderness
    25                  San Diego
    26                 Wilderness
    27                     Clovis
    28                 Los Alamos
    29               Ft. Duschene
                     ...         
    18211                 Holyoke
    18212                  Carson
    18213                Pasadena
    18214                  Austin
    18215                El Campo
    18216            Garden Grove
    18217           Berthoud Pass
    18218              Sisterdale
    18219            Garden Grove
    18220             Shasta Lake
    18221                Franklin
    18222          Albrightsville
    18223              Greenville
    18224                 Eufaula
    18225             Simi Valley
    18226           San Francisco
    18227           San Francisco
    18228              Kingsville
    18229                 Chicago
    18230             Pismo Beach
    18231             Pismo Beach
    18232                    Lodi
    18233               Anchorage
    18234                Capitola
    18235          Fountain Hills
    18236              Grant Park
    18237             Spirit Lake
    18238             Eagle River
    18239             Eagle River
    18240                    Ybor
    Name: City, Length: 18241, dtype: object
     

    括号表示法总是有效,而点表示法有局限性:

    • 如果系列名称中有空格,则点符号不起作用
    • 如果系列与DataFrame方法或属性(如'head'或'shape')具有相同的名称,则点符号不起作用
    • 点符号不能用于定义新series的名
    In [10]:
     
     
     
     
     
    # #这里的拼接也不能用点的方法
    ufo['Location'] = ufo.City + ', ' + ufo.State
    ufo.head()
     
     
    Out[10]:
     CityColors ReportedShape ReportedStateTimeLocation
    0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00 Ithaca, NY
    1 Willingboro NaN OTHER NJ 6/30/1930 20:00 Willingboro, NJ
    2 Holyoke NaN OVAL CO 2/15/1931 14:00 Holyoke, CO
    3 Abilene NaN DISK KS 6/1/1931 13:00 Abilene, KS
    4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00 New York Worlds Fair, NY
     

    4. Why do some pandas commands end with parentheses (and others don't)? (video)

    In [11]:
     
     
     
     
     
    # 将顶级IMDb电影的数据集读入DataFrame
    url4="https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/imdb_1000.csv"
    movies = pd.read_csv(url4)
     
     
     

    #方法以括号结尾,而属性则没有:

    In [12]:
     
     
     
     
     
    # 示例方法:显示前5行
    movies.head()
     
     
    Out[12]:
     star_ratingtitlecontent_ratinggenredurationactors_list
    0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
    1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
    2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
    3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
    4 8.9 Pulp Fiction R Crime 154 [u'John Travolta', u'Uma Thurman', u'Samuel L....
    In [13]:
     
     
     
     
     
    #示例方法:计算摘要统计信息
    movies.describe()
     
     
    Out[13]:
     star_ratingduration
    count 979.000000 979.000000
    mean 7.889785 120.979571
    std 0.336069 26.218010
    min 7.400000 64.000000
    25% 7.600000 102.000000
    50% 7.800000 117.000000
    75% 8.100000 134.000000
    max 9.300000 242.000000
    In [14]:
     
     
     
     
     
    movies.describe(include=['object'])
     
     
    Out[14]:
     titlecontent_ratinggenreactors_list
    count 979 976 979 979
    unique 975 12 16 969
    top Dracula R Drama [u'Daniel Radcliffe', u'Emma Watson', u'Rupert...
    freq 2 460 278 6
    In [15]:
     
     
     
     
     
    # 示例属性:行数和列数
    movies.shape
     
     
    Out[15]:
    (979, 6)
    In [16]:
     
     
     
     
     
    # 示例属性:每列的数据类型
    movies.dtypes
     
     
    Out[16]:
    star_rating       float64
    title              object
    content_rating     object
    genre              object
    duration            int64
    actors_list        object
    dtype: object
    In [17]:
     
     
     
     
     
    # 使用describe方法的可选参数来仅汇总'object'列
    movies.describe(include=['object'])
     
     
    Out[17]:
     titlecontent_ratinggenreactors_list
    count 979 976 979 979
    unique 975 12 16 969
    top Dracula R Drama [u'Daniel Radcliffe', u'Emma Watson', u'Rupert...
    freq 2 460 278 6
     

    Documentation for describe

    [Back to top]

     

    5. How do I rename columns in a pandas DataFrame? (video)

    In [18]:
     
     
     
     
     
    # 检查列名称
    ufo.columns
     
     
    Out[18]:
    Index(['City', 'Colors Reported', 'Shape Reported', 'State', 'Time',
           'Location'],
          dtype='object')
    In [19]:
     
     
     
     
     
     
    # 使用'rename'方法重命名其中两列
    ufo.rename(columns={'Colors Reported':'Colors_Reported', 'Shape Reported':'Shape_Reported'}, inplace=True)
    ufo.columns
     
     
    Out[19]:
    Index(['City', 'Colors_Reported', 'Shape_Reported', 'State', 'Time',
           'Location'],
          dtype='object')
     

    Documentation for rename

    In [20]:
     
     
     
     
     
    # 通过覆盖'columns'属性替换所有列名
    ufo = pd.read_table(url3, sep=',')
    ufo_cols = ['city', 'colors reported', 'shape reported', 'state', 'time']
    ufo.columns = ufo_cols
    ufo.columns
     
     
    Out[20]:
    Index(['city', 'colors reported', 'shape reported', 'state', 'time'], dtype='object')
    In [21]:
     
     
     
     
     
    # 使用'names'参数替换文件读取过程中的列名
    ufo = pd.read_csv(url3, header=0, names=ufo_cols)
    ufo.columns
     
     
    Out[21]:
    Index(['city', 'colors reported', 'shape reported', 'state', 'time'], dtype='object')
     

    Documentation for read_csv

    In [22]:
     
     
     
     
     
    ufo.columns = ufo.columns.str.replace(' ', '_') #如何批量修改替换使得列名无空格
    ufo.columns
     
     
    Out[22]:
    Index(['city', 'colors_reported', 'shape_reported', 'state', 'time'], dtype='object')
     

    Documentation for str.replace

    [Back to top]

     

    6. How do I remove columns from a pandas DataFrame? (video)

    In [35]:
     
     
     
     
     
    ufo = pd.read_table(url3, sep=',')
    ufo.head()
     
     
    Out[35]:
     CityColors ReportedShape ReportedStateTime
    0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
    1 Willingboro NaN OTHER NJ 6/30/1930 20:00
    2 Holyoke NaN OVAL CO 2/15/1931 14:00
    3 Abilene NaN DISK KS 6/1/1931 13:00
    4 New York Worlds Fair NaN LIGHT NY 4/18/1933 19:00
    In [37]:
     
     
     
     
     
    # #axis=1 是纵向,inplace = True:不创建新的对象,直接对原始对象进行修改;
    ufo.drop('Colors Reported', axis=1, inplace=True)
    ufo.head()
     
     
    Out[37]:
     CityShape ReportedStateTime
    0 Ithaca TRIANGLE NY 6/1/1930 22:00
    1 Willingboro OTHER NJ 6/30/1930 20:00
    2 Holyoke OVAL CO 2/15/1931 14:00
    3 Abilene DISK KS 6/1/1931 13:00
    4 New York Worlds Fair LIGHT NY 4/18/1933 19:00
     

    Documentation for drop

    In [38]:
     
     
     
     
     
    # 一次删除多个列
    ufo.drop(['City', 'State'], axis=1, inplace=True)
    ufo.head()
     
     
    Out[38]:
     Shape ReportedTime
    0 TRIANGLE 6/1/1930 22:00
    1 OTHER 6/30/1930 20:00
    2 OVAL 2/15/1931 14:00
    3 DISK 6/1/1931 13:00
    4 LIGHT 4/18/1933 19:00
    In [39]:
     
     
     
     
     
    # 一次删除多行(axis = 0表示行)
    ufo.drop([0, 1], axis=0, inplace=True)
    ufo.head()
    #删除4行 按标签,axis=0 是横向,默认为横向,但建议写出来
     
     
    Out[39]:
     Shape ReportedTime
    2 OVAL 2/15/1931 14:00
    3 DISK 6/1/1931 13:00
    4 LIGHT 4/18/1933 19:00
    5 DISK 9/15/1934 15:30
    6 CIRCLE 6/15/1935 0:00
     

    7. How do I sort a pandas DataFrame or a Series? (video)

    In [40]:
     
     
     
     
     
    movies.head()
     
     
    Out[40]:
     star_ratingtitlecontent_ratinggenredurationactors_list
    0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
    1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
    2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
    3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
    4 8.9 Pulp Fiction R Crime 154 [u'John Travolta', u'Uma Thurman', u'Samuel L....
     

    #注意:以下任何排序方法都不会影响基础数据。 (换句话说,排序是暂时的)。

    In [41]:
     
     
     
     
     
    #排序单个Series
    movies.title.sort_values().head()
     
     
    Out[41]:
    542     (500) Days of Summer
    5               12 Angry Men
    201         12 Years a Slave
    698                127 Hours
    110    2001: A Space Odyssey
    Name: title, dtype: object
    In [42]:
     
     
     
     
     
    # #排序单个Series 倒序
    movies.title.sort_values(ascending=False).head()
     
     
    Out[42]:
    864               [Rec]
    526                Zulu
    615          Zombieland
    677              Zodiac
    955    Zero Dark Thirty
    Name: title, dtype: object
     

    Documentation for sort_values for a Series. (Prior to version 0.17, use order instead.)

    In [43]:
     
     
     
     
     
    # #以单个Series排序DataFrame
    movies.sort_values('title').head()
     
     
    Out[43]:
     star_ratingtitlecontent_ratinggenredurationactors_list
    542 7.8 (500) Days of Summer PG-13 Comedy 95 [u'Zooey Deschanel', u'Joseph Gordon-Levitt', ...
    5 8.9 12 Angry Men NOT RATED Drama 96 [u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
    201 8.1 12 Years a Slave R Biography 134 [u'Chiwetel Ejiofor', u'Michael Kenneth Willia...
    698 7.6 127 Hours R Adventure 94 [u'James Franco', u'Amber Tamblyn', u'Kate Mara']
    110 8.3 2001: A Space Odyssey G Mystery 160 [u'Keir Dullea', u'Gary Lockwood', u'William S...
    In [44]:
     
     
     
     
     
    # 改为按降序排序
    movies.sort_values('title', ascending=False).head()
     
     
    Out[44]:
     star_ratingtitlecontent_ratinggenredurationactors_list
    864 7.5 [Rec] R Horror 78 [u'Manuela Velasco', u'Ferran Terraza', u'Jorg...
    526 7.8 Zulu UNRATED Drama 138 [u'Stanley Baker', u'Jack Hawkins', u'Ulla Jac...
    615 7.7 Zombieland R Comedy 88 [u'Jesse Eisenberg', u'Emma Stone', u'Woody Ha...
    677 7.7 Zodiac R Crime 157 [u'Jake Gyllenhaal', u'Robert Downey Jr.', u'M...
    955 7.4 Zero Dark Thirty R Drama 157 [u'Jessica Chastain', u'Joel Edgerton', u'Chri...
     

    Documentation for sort_values for a DataFrame. (Prior to version 0.17, use sort instead.)

    In [45]:
     
     
     
     
     
    # 首先按'content_rating',然后按duration'排序DataFrame
    movies.sort_values(['content_rating', 'duration']).head()
     
     
    Out[45]:
     star_ratingtitlecontent_ratinggenredurationactors_list
    713 7.6 The Jungle Book APPROVED Animation 78 [u'Phil Harris', u'Sebastian Cabot', u'Louis P...
    513 7.8 Invasion of the Body Snatchers APPROVED Horror 80 [u'Kevin McCarthy', u'Dana Wynter', u'Larry Ga...
    272 8.1 The Killing APPROVED Crime 85 [u'Sterling Hayden', u'Coleen Gray', u'Vince E...
    703 7.6 Dracula APPROVED Horror 85 [u'Bela Lugosi', u'Helen Chandler', u'David Ma...
    612 7.7 A Hard Day's Night APPROVED Comedy 87 [u'John Lennon', u'Paul McCartney', u'George H...
     

    8. How do I filter rows of a pandas DataFrame by column value? (video)

    In [46]:
     
     
     
     
     
    movies.head()
     
     
    Out[46]:
     star_ratingtitlecontent_ratinggenredurationactors_list
    0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
    1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
    2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
    3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
    4 8.9 Pulp Fiction R Crime 154 [u'John Travolta', u'Uma Thurman', u'Samuel L....
    In [47]:
     
     
     
     
     
    # 检查行数和列数
    movies.shape
     
     
    Out[47]:
    (979, 6)
     

    ##目标:过滤DataFrame行,仅显示“持续时间”至少为200分钟的电影

    In [48]:
     
     
     
     
     
    # 
    #先展示一个比较复杂的方法,用一个for循环制造一个和原数据一样行数,判断每一行是否符合条件,列表元素均为boolean
    #创建一个列表,其中每个元素引用一个DataFrame行:如果行满足条件,则返回true,否则返回False
    booleans = []
    for length in movies.duration:
        if length >= 200:
            booleans.append(True)
        else:
            booleans.append(False)
     
     
    In [49]:
     
     
     
     
     
    # 确认列表与DataFrame的长度相同
    len(booleans)
     
     
    Out[49]:
    979
    In [50]:
     
     
     
     
     
    # 检查前五个列表元素
    booleans[0:5]
     
     
    Out[50]:
    [False, False, True, False, False]
    In [51]:
     
     
     
     
     
    # 将列表转换为Series
    is_long = pd.Series(booleans)
    is_long.head()
     
     
    Out[51]:
    0    False
    1    False
    2     True
    3    False
    4    False
    dtype: bool
    In [52]:
     
     
     
     
     
    # 使用带有布尔Series的括号表示法告诉DataFrame movies[is_long]要显示哪些行
    movies[is_long]
     
     
    Out[52]:
     star_ratingtitlecontent_ratinggenredurationactors_list
    2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
    7 8.9 The Lord of the Rings: The Return of the King PG-13 Adventure 201 [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...
    17 8.7 Seven Samurai UNRATED Drama 207 [u'Toshirxf4 Mifune', u'Takashi Shimura', u'K...
    78 8.4 Once Upon a Time in America R Crime 229 [u'Robert De Niro', u'James Woods', u'Elizabet...
    85 8.4 Lawrence of Arabia PG Adventure 216 [u"Peter O'Toole", u'Alec Guinness', u'Anthony...
    142 8.3 Lagaan: Once Upon a Time in India PG Adventure 224 [u'Aamir Khan', u'Gracy Singh', u'Rachel Shell...
    157 8.2 Gone with the Wind G Drama 238 [u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
    204 8.1 Ben-Hur G Adventure 212 [u'Charlton Heston', u'Jack Hawkins', u'Stephe...
    445 7.9 The Ten Commandments APPROVED Adventure 220 [u'Charlton Heston', u'Yul Brynner', u'Anne Ba...
    476 7.8 Hamlet PG-13 Drama 242 [u'Kenneth Branagh', u'Julie Christie', u'Dere...
    630 7.7 Malcolm X PG-13 Biography 202 [u'Denzel Washington', u'Angela Bassett', u'De...
    767 7.6 It's a Mad, Mad, Mad, Mad World APPROVED Action 205 [u'Spencer Tracy', u'Milton Berle', u'Ethel Me...
    In [53]:
     
     
     
     
     
    # 简化上面的步骤:不需要编写for循环来创建is_long'
    is_long = movies.duration >= 200
    movies[is_long]#运用这种写法,pandas就知道,按照这个series去筛选
    # 或等效地,将其写在一行(无需创建'is_long'对象)
    movies[movies.duration >= 200]
     
     
    Out[53]:
     star_ratingtitlecontent_ratinggenredurationactors_list
    2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
    7 8.9 The Lord of the Rings: The Return of the King PG-13 Adventure 201 [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...
    17 8.7 Seven Samurai UNRATED Drama 207 [u'Toshirxf4 Mifune', u'Takashi Shimura', u'K...
    78 8.4 Once Upon a Time in America R Crime 229 [u'Robert De Niro', u'James Woods', u'Elizabet...
    85 8.4 Lawrence of Arabia PG Adventure 216 [u"Peter O'Toole", u'Alec Guinness', u'Anthony...
    142 8.3 Lagaan: Once Upon a Time in India PG Adventure 224 [u'Aamir Khan', u'Gracy Singh', u'Rachel Shell...
    157 8.2 Gone with the Wind G Drama 238 [u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
    204 8.1 Ben-Hur G Adventure 212 [u'Charlton Heston', u'Jack Hawkins', u'Stephe...
    445 7.9 The Ten Commandments APPROVED Adventure 220 [u'Charlton Heston', u'Yul Brynner', u'Anne Ba...
    476 7.8 Hamlet PG-13 Drama 242 [u'Kenneth Branagh', u'Julie Christie', u'Dere...
    630 7.7 Malcolm X PG-13 Biography 202 [u'Denzel Washington', u'Angela Bassett', u'De...
    767 7.6 It's a Mad, Mad, Mad, Mad World APPROVED Action 205 [u'Spencer Tracy', u'Milton Berle', u'Ethel Me...
    In [54]:
     
     
     
     
     
    # 从过滤后的DataFrame中选择“流派”系列
    movies[movies.duration >= 200].genre
    # 或者等效地,使用'loc'方法
    movies.loc[movies.duration >= 200, 'genre']
     
     
    Out[54]:
    2          Crime
    7      Adventure
    17         Drama
    78         Crime
    85     Adventure
    142    Adventure
    157        Drama
    204    Adventure
    445    Adventure
    476        Drama
    630    Biography
    767       Action
    Name: genre, dtype: object
     

    Documentation for loc

    [Back to top]

     

    9. How do I apply multiple filter criteria to a pandas DataFrame? (video)

    In [55]:
     
     
     
     
     
    # read a dataset of top-rated IMDb movies into a DataFrame
    movies = pd.read_csv('http://bit.ly/imdbratings')
    movies.head()
     
     
    Out[55]:
     star_ratingtitlecontent_ratinggenredurationactors_list
    0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
    1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
    2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
    3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
    4 8.9 Pulp Fiction R Crime 154 [u'John Travolta', u'Uma Thurman', u'Samuel L....
    In [56]:
     
     
     
     
     
    # 过滤DataFrame仅显示“持续时间”至少为200分钟的电影
    movies[movies.duration >= 200]
     
     
    Out[56]:
     star_ratingtitlecontent_ratinggenredurationactors_list
    2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
    7 8.9 The Lord of the Rings: The Return of the King PG-13 Adventure 201 [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...
    17 8.7 Seven Samurai UNRATED Drama 207 [u'Toshirxf4 Mifune', u'Takashi Shimura', u'K...
    78 8.4 Once Upon a Time in America R Crime 229 [u'Robert De Niro', u'James Woods', u'Elizabet...
    85 8.4 Lawrence of Arabia PG Adventure 216 [u"Peter O'Toole", u'Alec Guinness', u'Anthony...
    142 8.3 Lagaan: Once Upon a Time in India PG Adventure 224 [u'Aamir Khan', u'Gracy Singh', u'Rachel Shell...
    157 8.2 Gone with the Wind G Drama 238 [u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
    204 8.1 Ben-Hur G Adventure 212 [u'Charlton Heston', u'Jack Hawkins', u'Stephe...
    445 7.9 The Ten Commandments APPROVED Adventure 220 [u'Charlton Heston', u'Yul Brynner', u'Anne Ba...
    476 7.8 Hamlet PG-13 Drama 242 [u'Kenneth Branagh', u'Julie Christie', u'Dere...
    630 7.7 Malcolm X PG-13 Biography 202 [u'Denzel Washington', u'Angela Bassett', u'De...
    767 7.6 It's a Mad, Mad, Mad, Mad World APPROVED Action 205 [u'Spencer Tracy', u'Milton Berle', u'Ethel Me...
     

    理解逻辑运算符:

    • and:仅当运算符的两边都为True时才为真
    • or:如果运算符的任何一侧为True,则为真
    In [57]:
     
     
     
     
     
    print(True and True)
    print(True and False)
    print(False and False)
     
     
     
    True
    False
    False
    
    In [58]:
     
     
     
     
     
    print(True or True)
    print(True or False)
    print(False or False)
     
     
     
    True
    True
    False
    
     

    在pandas中指定多个过滤条件的规则:

    使用&而不是和 使用|而不是或 在每个条件周围添加括号以指定评估顺序

     

    Goal: Further filter the DataFrame of long movies (duration >= 200) to only show movies which also have a 'genre' of 'Drama'

    In [59]:
     
     
     
     
     
    # 使用'&'运算符指定两个条件都是必需的
    movies[(movies.duration >=200) & (movies.genre == 'Drama')]
     
     
    Out[59]:
     star_ratingtitlecontent_ratinggenredurationactors_list
    17 8.7 Seven Samurai UNRATED Drama 207 [u'Toshirxf4 Mifune', u'Takashi Shimura', u'K...
    157 8.2 Gone with the Wind G Drama 238 [u'Clark Gable', u'Vivien Leigh', u'Thomas Mit...
    476 7.8 Hamlet PG-13 Drama 242 [u'Kenneth Branagh', u'Julie Christie', u'Dere...
    In [60]:
     
     
     
     
     
    # I不正确:使用'|'运算符会展示长或戏剧的电影
    movies[(movies.duration >=200) | (movies.genre == 'Drama')].head()
     
     
    Out[60]:
     star_ratingtitlecontent_ratinggenredurationactors_list
    2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
    5 8.9 12 Angry Men NOT RATED Drama 96 [u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
    7 8.9 The Lord of the Rings: The Return of the King PG-13 Adventure 201 [u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...
    9 8.9 Fight Club R Drama 139 [u'Brad Pitt', u'Edward Norton', u'Helena Bonh...
    13 8.8 Forrest Gump PG-13 Drama 142 [u'Tom Hanks', u'Robin Wright', u'Gary Sinise']
     

    ##过滤原始数据框以显示“类型”为“犯罪”或“戏剧”或“动作”的电影

    In [61]:
     
     
     
     
     
     
    # 使用'|'运算符指定行可以匹配三个条件中的任何一个
    movies[(movies.genre == 'Crime') | (movies.genre == 'Drama') | (movies.genre == 'Action')].head(10)
    # 用isin等效
    movies[movies.genre.isin(['Crime', 'Drama', 'Action'])].head(10)
     
     
    Out[61]:
     star_ratingtitlecontent_ratinggenredurationactors_list
    0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
    1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
    2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
    3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
    4 8.9 Pulp Fiction R Crime 154 [u'John Travolta', u'Uma Thurman', u'Samuel L....
    5 8.9 12 Angry Men NOT RATED Drama 96 [u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
    9 8.9 Fight Club R Drama 139 [u'Brad Pitt', u'Edward Norton', u'Helena Bonh...
    11 8.8 Inception PG-13 Action 148 [u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'...
    12 8.8 Star Wars: Episode V - The Empire Strikes Back PG Action 124 [u'Mark Hamill', u'Harrison Ford', u'Carrie Fi...
    13 8.8 Forrest Gump PG-13 Drama 142 [u'Tom Hanks', u'Robin Wright', u'Gary Sinise']
     

    Documentation for isin

    [Back to top]

     

    10. Your pandas questions answered! (video)

     

    Question: When reading from a file, how do I read in only a subset of the columns?

    In [62]:
     
     
     
     
     
    url3 = "https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv"#定义列名
    ufo = pd.read_csv(url3)#用read_csv打开csv文件
    ufo.columns
     
     
    Out[62]:
    Index(['City', 'Colors Reported', 'Shape Reported', 'State', 'Time'], dtype='object')
    In [63]:
     
     
     
     
     
    # 列名筛选
    ufo = pd.read_csv(url3, usecols=['City', 'State'])
    # 用位置切片等效
    ufo = pd.read_csv(url3, usecols=[0, 4])
    ufo.columns
     
     
    Out[63]:
    Index(['City', 'Time'], dtype='object')
     

    Question: When reading from a file, how do I read in only a subset of the rows?

    In [64]:
     
     
     
     
     
    # 读3行数据
    ufo = pd.read_csv(url3, nrows=3)
    ufo
     
     
    Out[64]:
     CityColors ReportedShape ReportedStateTime
    0 Ithaca NaN TRIANGLE NY 6/1/1930 22:00
    1 Willingboro NaN OTHER NJ 6/30/1930 20:00
    2 Holyoke NaN OVAL CO 2/15/1931 14:00
     

    Documentation for read_csv

     

    Question: How do I iterate through a Series?

    In [65]:
     
     
     
     
     
    # Series可直接迭代(如列表)
    for c in ufo.City:
        print(c)
     
     
     
    Ithaca
    Willingboro
    Holyoke
    
     

    Question: How do I iterate through a DataFrame?

    In [66]:
     
     
     
     
     
    # 可以使用各种方法迭代DataFrame
    for index, row in ufo.iterrows():
        print(index, row.City, row.State)
     
     
     
    0 Ithaca NY
    1 Willingboro NJ
    2 Holyoke CO
    
     

    Documentation for iterrows

     

    Question: How do I drop all non-numeric columns from a DataFrame?

    In [67]:
     
     
     
     
     
    # 将酒精消耗数据集读入DataFrame,并检查数据类型
    url7= 'https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/drinks.csv'
    drinks = pd.read_csv(url7)
    drinks.dtypes
     
     
    Out[67]:
    country                          object
    beer_servings                     int64
    spirit_servings                   int64
    wine_servings                     int64
    total_litres_of_pure_alcohol    float64
    continent                        object
    dtype: object
    In [68]:
     
     
     
     
     
    # 仅包含DataFrame中的数字列
    import numpy as np
    drinks.select_dtypes(include=[np.number]).dtypes
     
     
    Out[68]:
    beer_servings                     int64
    spirit_servings                   int64
    wine_servings                     int64
    total_litres_of_pure_alcohol    float64
    dtype: object
     

    Documentation for select_dtypes

     

    Question: How do I know whether I should pass an argument as a string or a list?

    In [69]:
     
     
     
     
     
    # 描述所有数字列
    drinks.describe()
     
     
    Out[69]:
     beer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcohol
    count 193.000000 193.000000 193.000000 193.000000
    mean 106.160622 80.994819 49.450777 4.717098
    std 101.143103 88.284312 79.697598 3.773298
    min 0.000000 0.000000 0.000000 0.000000
    25% 20.000000 4.000000 1.000000 1.300000
    50% 76.000000 56.000000 8.000000 4.200000
    75% 188.000000 128.000000 59.000000 7.200000
    max 376.000000 438.000000 370.000000 14.400000
    In [70]:
     
     
     
     
     
    # 传递字符串'all'来描述所有列
    drinks.describe(include='all')
     
     
    Out[70]:
     countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
    count 193 193.000000 193.000000 193.000000 193.000000 193
    unique 193 NaN NaN NaN NaN 6
    top Marshall Islands NaN NaN NaN NaN Africa
    freq 1 NaN NaN NaN NaN 53
    mean NaN 106.160622 80.994819 49.450777 4.717098 NaN
    std NaN 101.143103 88.284312 79.697598 3.773298 NaN
    min NaN 0.000000 0.000000 0.000000 0.000000 NaN
    25% NaN 20.000000 4.000000 1.000000 1.300000 NaN
    50% NaN 76.000000 56.000000 8.000000 4.200000 NaN
    75% NaN 188.000000 128.000000 59.000000 7.200000 NaN
    max NaN 376.000000 438.000000 370.000000 14.400000 NaN
    In [71]:
     
     
     
     
     
    # 传递数据类型列表以仅描述多个类型
    drinks.describe(include=['object', 'float64'])
     
     
    Out[71]:
     countrytotal_litres_of_pure_alcoholcontinent
    count 193 193.000000 193
    unique 193 NaN 6
    top Marshall Islands NaN Africa
    freq 1 NaN 53
    mean NaN 4.717098 NaN
    std NaN 3.773298 NaN
    min NaN 0.000000 NaN
    25% NaN 1.300000 NaN
    50% NaN 4.200000 NaN
    75% NaN 7.200000 NaN
    max NaN 14.400000 NaN
    In [72]:
     
     
     
     
     
    # 即使您只想描述单个数据类型,也要传递一个列表
    drinks.describe(include=['object'])
     
     
    Out[72]:
     countrycontinent
    count 193 193
    unique 193 6
    top Marshall Islands Africa
    freq 1 53
     

    Documentation for describe

    [Back to top]

     

    11. How do I use the "axis" parameter in pandas? (video)

    In [73]:
     
     
     
     
     
    url7= 'https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/drinks.csv'
    drinks = pd.read_csv(url7)
    drinks.head()
     
     
    Out[73]:
     countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
    0 Afghanistan 0 0 0 0.0 Asia
    1 Albania 89 132 54 4.9 Europe
    2 Algeria 25 0 14 0.7 Africa
    3 Andorra 245 138 312 12.4 Europe
    4 Angola 217 57 45 5.9 Africa
    In [74]:
     
     
     
     
     
    # drop a column (temporarily)
    drinks.drop('continent', axis=1).head()
     
     
    Out[74]:
     countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcohol
    0 Afghanistan 0 0 0 0.0
    1 Albania 89 132 54 4.9
    2 Algeria 25 0 14 0.7
    3 Andorra 245 138 312 12.4
    4 Angola 217 57 45 5.9
     

    Documentation for drop

    In [75]:
     
     
     
     
     
    # 删除一列(暂时)
    drinks.drop(2, axis=0).head()
     
     
    Out[75]:
     countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
    0 Afghanistan 0 0 0 0.0 Asia
    1 Albania 89 132 54 4.9 Europe
    3 Andorra 245 138 312 12.4 Europe
    4 Angola 217 57 45 5.9 Africa
    5 Antigua & Barbuda 102 128 45 4.9 North America
     

    使用axis参数引用行或列时:

    axis 0表示行 axis 1指的是列

    In [76]:
     
     
     
     
     
    # 计算每个数字列的平均值
    drinks.mean()
    # 或等效地,明确指定轴
    drinks.mean(axis=0)
     
     
    Out[76]:
    beer_servings                   106.160622
    spirit_servings                  80.994819
    wine_servings                    49.450777
    total_litres_of_pure_alcohol      4.717098
    dtype: float64
     

    Documentation for mean

    In [77]:
     
     
     
     
     
    # 计算每一行的平均值
    drinks.mean(axis=1).head()
     
     
    Out[77]:
    0      0.000
    1     69.975
    2      9.925
    3    176.850
    4     81.225
    dtype: float64
     

    使用axis参数执行数学运算时:

    • *axis0 *表示操作应“向下移动”行轴
    • *axis1 *表示操作应“移过”列轴
    In [78]:
     
     
     
     
     
    # 'index' 等效 axis 0
    drinks.mean(axis='index')
     
     
    Out[78]:
    beer_servings                   106.160622
    spirit_servings                  80.994819
    wine_servings                    49.450777
    total_litres_of_pure_alcohol      4.717098
    dtype: float64
    In [79]:
     
     
     
     
     
    # 'columns' 等效 axis 1
    drinks.mean(axis='columns').head()
     
     
    Out[79]:
    0      0.000
    1     69.975
    2      9.925
    3    176.850
    4     81.225
    dtype: float64
     

    12. How do I use string methods in pandas? (video)

    In [80]:
     
     
     
     
     
    url1 = "https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/chipotle.tsv" 
    #定义地址
    orders =pd.read_table(url1)#使用read_table()打开
    orders.head()
     
     
    Out[80]:
     order_idquantityitem_namechoice_descriptionitem_price
    0 1 1 Chips and Fresh Tomato Salsa NaN $2.39
    1 1 1 Izze [Clementine] $3.39
    2 1 1 Nantucket Nectar [Apple] $3.39
    3 1 1 Chips and Tomatillo-Green Chili Salsa NaN $2.39
    4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98
    In [81]:
     
     
     
     
     
    # 在Python中访问字符串方法的常用方法
    'hello'.upper()
     
     
    Out[81]:
    'HELLO'
    In [82]:
     
     
     
     
     
    # spandas Series 的字符串方法通过'str'访问
    orders.item_name.str.upper().head()
     
     
    Out[82]:
    0             CHIPS AND FRESH TOMATO SALSA
    1                                     IZZE
    2                         NANTUCKET NECTAR
    3    CHIPS AND TOMATILLO-GREEN CHILI SALSA
    4                             CHICKEN BOWL
    Name: item_name, dtype: object
    In [83]:
     
     
     
     
     
    # string方法'contains'检查子字符串并返回一个布尔Series
    orders.item_name.str.contains('Chicken').head()
     
     
    Out[83]:
    0    False
    1    False
    2    False
    3    False
    4     True
    Name: item_name, dtype: bool
    In [84]:
     
     
     
     
     
    # 布尔Series筛选DataFrame
    orders[orders.item_name.str.contains('Chicken')].head()
     
     
    Out[84]:
     order_idquantityitem_namechoice_descriptionitem_price
    4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98
    5 3 1 Chicken Bowl [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... $10.98
    11 6 1 Chicken Crispy Tacos [Roasted Chili Corn Salsa, [Fajita Vegetables,... $8.75
    12 6 1 Chicken Soft Tacos [Roasted Chili Corn Salsa, [Rice, Black Beans,... $8.75
    13 7 1 Chicken Bowl [Fresh Tomato Salsa, [Fajita Vegetables, Rice,... $11.25
    In [85]:
     
     
     
     
     
    # 字符串方法可以链接在一起
    orders.choice_description.str.replace('[', '').str.replace(']', '').head()
     
     
    Out[85]:
    0                                                  NaN
    1                                           Clementine
    2                                                Apple
    3                                                  NaN
    4    Tomatillo-Red Chili Salsa (Hot), Black Beans, ...
    Name: choice_description, dtype: object
    In [86]:
     
     
     
     
     
    # 许多pandas字符串方法支持正则表达式
    orders.choice_description.str.replace('[[]]', '').head()
     
     
    Out[86]:
    0                                                  NaN
    1                                           Clementine
    2                                                Apple
    3                                                  NaN
    4    Tomatillo-Red Chili Salsa (Hot), Black Beans, ...
    Name: choice_description, dtype: object
     

    String handling section of the pandas API reference

    [Back to top]

     

    13. How do I change the data type of a pandas Series? (video)

    In [87]:
     
     
     
     
     
    url7= 'https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/drinks.csv'
    drinks = pd.read_csv(url7)
    drinks.head()
     
     
    Out[87]:
     countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
    0 Afghanistan 0 0 0 0.0 Asia
    1 Albania 89 132 54 4.9 Europe
    2 Algeria 25 0 14 0.7 Africa
    3 Andorra 245 138 312 12.4 Europe
    4 Angola 217 57 45 5.9 Africa
    In [88]:
     
     
     
     
     
    # 检查每个系列的数据类型
    drinks.dtypes
     
     
    Out[88]:
    country                          object
    beer_servings                     int64
    spirit_servings                   int64
    wine_servings                     int64
    total_litres_of_pure_alcohol    float64
    continent                        object
    dtype: object
    In [89]:
     
     
     
     
     
    # 更改现有系列的数据类型
    drinks['beer_servings'] = drinks.beer_servings.astype(float)
    drinks.dtypes
     
     
    Out[89]:
    country                          object
    beer_servings                   float64
    spirit_servings                   int64
    wine_servings                     int64
    total_litres_of_pure_alcohol    float64
    continent                        object
    dtype: object
     

    Documentation for astype

    In [90]:
     
     
     
     
     
    # 或者,在读取文件时更改系列的数据类型
    drinks = pd.read_csv(url7, dtype={'beer_servings':float})
    drinks.dtypes
     
     
    Out[90]:
    country                          object
    beer_servings                   float64
    spirit_servings                   int64
    wine_servings                     int64
    total_litres_of_pure_alcohol    float64
    continent                        object
    dtype: object
    In [91]:
     
     
     
     
     
    orders = pd.read_table(url1)
    orders.head()
     
     
    Out[91]:
     order_idquantityitem_namechoice_descriptionitem_price
    0 1 1 Chips and Fresh Tomato Salsa NaN $2.39
    1 1 1 Izze [Clementine] $3.39
    2 1 1 Nantucket Nectar [Apple] $3.39
    3 1 1 Chips and Tomatillo-Green Chili Salsa NaN $2.39
    4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98
    In [92]:
     
     
     
     
     
    # 检查每个系列的数据类型
    orders.dtypes
     
     
    Out[92]:
    order_id               int64
    quantity               int64
    item_name             object
    choice_description    object
    item_price            object
    dtype: object
    In [93]:
     
     
     
     
     
    # 将字符串转换为数字以进行数学运算
    orders.item_price.str.replace('$', '').astype(float).mean()
     
     
    Out[93]:
    7.464335785374397
    In [94]:
     
     
     
     
     
    # 字符串方法'contains'检查子字符串并返回一个布尔系列
    orders.item_name.str.contains('Chicken').head()
     
     
    Out[94]:
    0    False
    1    False
    2    False
    3    False
    4     True
    Name: item_name, dtype: bool
    In [95]:
     
     
     
     
     
    # 将布尔系列转换为整数(False = 0,True = 1)
    orders.item_name.str.contains('Chicken').astype(int).head()
     
     
    Out[95]:
    0    0
    1    0
    2    0
    3    0
    4    1
    Name: item_name, dtype: int32
     

    14. When should I use a "groupby" in pandas? (video)

    In [96]:
     
     
     
     
     
    drinks = pd.read_csv(url7)
    drinks.head()
     
     
    Out[96]:
     countrybeer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcoholcontinent
    0 Afghanistan 0 0 0 0.0 Asia
    1 Albania 89 132 54 4.9 Europe
    2 Algeria 25 0 14 0.7 Africa
    3 Andorra 245 138 312 12.4 Europe
    4 Angola 217 57 45 5.9 Africa
    In [97]:
     
     
     
     
     
    # 计算整个数据集中的平均beer_servings
    drinks.beer_servings.mean()
     
     
    Out[97]:
    106.16062176165804
    In [98]:
     
     
     
     
     
    # 计算非洲国家的平均beer_servings
    drinks[drinks.continent=='Africa'].beer_servings.mean()
     
     
    Out[98]:
    61.471698113207545
    In [99]:
     
     
     
     
     
    #计算每个大陆的平均beer_servings
    drinks.groupby('continent').beer_servings.mean()
     
     
    Out[99]:
    continent
    Africa            61.471698
    Asia              37.045455
    Europe           193.777778
    North America    145.434783
    Oceania           89.687500
    South America    175.083333
    Name: beer_servings, dtype: float64
     

    Documentation for groupby

    In [100]:
     
     
     
     
     
    # 其他聚合函数(例如'max')也可以与groupby一起使用
    drinks.groupby('continent').beer_servings.max()
     
     
    Out[100]:
    continent
    Africa           376
    Asia             247
    Europe           361
    North America    285
    Oceania          306
    South America    333
    Name: beer_servings, dtype: int64
    In [101]:
     
     
     
     
     
    # 多个聚合函数可以同时应用
    drinks.groupby('continent').beer_servings.agg(['count', 'mean', 'min', 'max'])
     
     
    Out[101]:
     countmeanminmax
    continent    
    Africa 53 61.471698 0 376
    Asia 44 37.045455 0 247
    Europe 45 193.777778 0 361
    North America 23 145.434783 1 285
    Oceania 16 89.687500 0 306
    South America 12 175.083333 93 333
     

    Documentation for agg

    In [102]:
     
     
     
     
     
    # 不指定列,就会算出所有数值列
    drinks.groupby('continent').mean()
     
     
    Out[102]:
     beer_servingsspirit_servingswine_servingstotal_litres_of_pure_alcohol
    continent    
    Africa 61.471698 16.339623 16.264151 3.007547
    Asia 37.045455 60.840909 9.068182 2.170455
    Europe 193.777778 132.555556 142.222222 8.617778
    North America 145.434783 165.739130 24.521739 5.995652
    Oceania 89.687500 58.437500 35.625000 3.381250
    South America 175.083333 114.750000 62.416667 6.308333
    In [103]:
     
     
     
     
     
    # 允许绘图出现在jupyter notebook中
    %matplotlib inline
     
     
    In [104]:
     
     
     
     
     
    # 直接在上面的DataFrame的并排条形图
    drinks.groupby('continent').mean().plot(kind='bar')
     
     
    Out[104]:
    <matplotlib.axes._subplots.AxesSubplot at 0x296edf23b70>
     
     

    Documentation for plot

    [Back to top]

     

    15. How do I explore a pandas Series? (video)

    In [105]:
     
     
     
     
     
    # read a dataset of top-rated IMDb movies into a DataFrame
    url4="https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/imdb_1000.csv"
    movies = pd.read_csv(url4)
    movies.head()
     
     
    Out[105]:
     star_ratingtitlecontent_ratinggenredurationactors_list
    0 9.3 The Shawshank Redemption R Crime 142 [u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
    1 9.2 The Godfather R Crime 175 [u'Marlon Brando', u'Al Pacino', u'James Caan']
    2 9.1 The Godfather: Part II R Crime 200 [u'Al Pacino', u'Robert De Niro', u'Robert Duv...
    3 9.0 The Dark Knight PG-13 Action 152 [u'Christian Bale', u'Heath Ledger', u'Aaron E...
    4 8.9 Pulp Fiction R Crime 154 [u'John Travolta', u'Uma Thurman', u'Samuel L....
    In [106]:
     
     
     
     
     
    # 检查数据类型
    movies.dtypes
     
     
    Out[106]:
    star_rating       float64
    title              object
    content_rating     object
    genre              object
    duration            int64
    actors_list        object
    dtype: object
     

    探索非数字系列

    In [107]:
     
     
     
     
     
    # 计算最常见值的非空值,唯一值和频率
    movies.genre.describe()
     
     
    Out[107]:
    count       979
    unique       16
    top       Drama
    freq        278
    Name: genre, dtype: object
     

    Documentation for describe

    In [108]:
     
     
     
     
     
    # 数Series中每个值发生的次数
    movies.genre.value_counts()
     
     
    Out[108]:
    Drama        278
    Comedy       156
    Action       136
    Crime        124
    Biography     77
    Adventure     75
    Animation     62
    Horror        29
    Mystery       16
    Western        9
    Sci-Fi         5
    Thriller       5
    Film-Noir      3
    Family         2
    Fantasy        1
    History        1
    Name: genre, dtype: int64
     

    Documentation for value_counts

    In [109]:
     
     
     
     
     
    # 显示百分比而不是原始计数
    movies.genre.value_counts(normalize=True)
     
     
    Out[109]:
    Drama        0.283963
    Comedy       0.159346
    Action       0.138917
    Crime        0.126660
    Biography    0.078652
    Adventure    0.076609
    Animation    0.063330
    Horror       0.029622
    Mystery      0.016343
    Western      0.009193
    Sci-Fi       0.005107
    Thriller     0.005107
    Film-Noir    0.003064
    Family       0.002043
    Fantasy      0.001021
    History      0.001021
    Name: genre, dtype: float64
    In [110]:
     
     
     
     
     
    # '输出的是一个Series
    type(movies.genre.value_counts())
     
     
    Out[110]:
    pandas.core.series.Series
    In [111]:
     
     
     
     
     
    # 可以使用Series方法
    movies.genre.value_counts().head()
     
     
    Out[111]:
    Drama        278
    Comedy       156
    Action       136
    Crime        124
    Biography     77
    Name: genre, dtype: int64
    In [112]:
     
     
     
     
     
    # 显示Series中唯一值
    movies.genre.unique()
     
     
    Out[112]:
    array(['Crime', 'Action', 'Drama', 'Western', 'Adventure', 'Biography',
           'Comedy', 'Animation', 'Mystery', 'Horror', 'Film-Noir', 'Sci-Fi',
           'History', 'Thriller', 'Family', 'Fantasy'], dtype=object)
    In [113]:
     
     
     
     
     
    #数Series中唯一值的数量
    movies.genre.nunique()
     
     
    Out[113]:
    16
     

    Documentation for unique and nunique

    In [114]:
     
     
     
     
     
    # 两个Series的交叉列表
    pd.crosstab(movies.genre, movies.content_rating)
     
     
    Out[114]:
    content_ratingAPPROVEDGGPNC-17NOT RATEDPASSEDPGPG-13RTV-MAUNRATEDX
    genre            
    Action 3 1 1 0 4 1 11 44 67 0 3 0
    Adventure 3 2 0 0 5 1 21 23 17 0 2 0
    Animation 3 20 0 0 3 0 25 5 5 0 1 0
    Biography 1 2 1 0 1 0 6 29 36 0 0 0
    Comedy 9 2 1 1 16 3 23 23 73 0 4 1
    Crime 6 0 0 1 7 1 6 4 87 0 11 1
    Drama 12 3 0 4 24 1 25 55 143 1 9 1
    Family 0 1 0 0 0 0 1 0 0 0 0 0
    Fantasy 0 0 0 0 0 0 0 0 1 0 0 0
    Film-Noir 1 0 0 0 1 0 0 0 0 0 1 0
    History 0 0 0 0 0 0 0 0 0 0 1 0
    Horror 2 0 0 1 1 0 1 2 16 0 5 1
    Mystery 4 1 0 0 1 0 1 2 6 0 1 0
    Sci-Fi 1 0 0 0 0 0 0 1 3 0 0 0
    Thriller 1 0 0 0 0 0 1 0 3 0 0 0
    Western 1 0 0 0 2 0 2 1 3 0 0 0
     

    Documentation for crosstab

     

    探索数字系列:

    In [115]:
     
     
     
     
     
    # 计算各种汇总统计
    movies.duration.describe()
     
     
    Out[115]:
    count    979.000000
    mean     120.979571
    std       26.218010
    min       64.000000
    25%      102.000000
    50%      117.000000
    75%      134.000000
    max      242.000000
    Name: duration, dtype: float64
    In [116]:
     
     
     
     
     
    # 许多统计数据都是作为Series方法实现的
    movies.duration.mean()
     
     
    Out[116]:
    120.97957099080695
     

    Documentation for mean

    In [117]:
     
     
     
     
     
    # 'value_counts' 主要用于分类数据,而不是数字数据
    movies.duration.value_counts().head()
     
     
    Out[117]:
    112    23
    113    22
    102    20
    101    20
    129    19
    Name: duration, dtype: int64
    In [118]:
     
     
     
     
     
    # 允许绘图出现在jupyter notebook中
    %matplotlib inline
     
     
    In [119]:
     
     
     
     
     
    # 'duration'Series的直方图(显示数值变量的分布)
    movies.duration.plot(kind='hist')
     
     
    Out[119]:
    <matplotlib.axes._subplots.AxesSubplot at 0x296ee26ba58>
     
    In [120]:
     
     
     
     
     
    # 'genre'Series'value_counts'的条形图
    movies.genre.value_counts().plot(kind='bar')
     
     
    Out[120]:
    <matplotlib.axes._subplots.AxesSubplot at 0x296ee2ccba8>
     
     

    Documentation for plot

    [Back to top]

     

     

  • 相关阅读:
    适合新手小白的UI学习路线完整版
    UI设计课程教程分享:Banner的设计和技巧
    UI设计:C4D作品案例分享
    还在凭实力单身吗,那是因为你还没学会这项技术
    PS故障风海报制作技术分享
    你真的了解标签栏设计吗?
    来看看N多设计师笔下的Spider Man
    羡慕女设计师啊,天生色感好!
    43. Multiply Strings
    40. Combination Sum II
  • 原文地址:https://www.cnblogs.com/romannista/p/10659805.html
Copyright © 2011-2022 走看看