zoukankan      html  css  js  c++  java
  • Pandas CookBook -- 01Pandas基础

    Pandas基础

    简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧

    import pandas as pd
    import numpy as np
    

    设定最大列数和最大行数

    pd.set_option('max_columns', 5, 'max_rows', 10)
    

    1 DataFrame的结构

    movie = pd.read_csv('data/movie.csv')
    
    movie.shape
    
    (4916, 28)
    

    2 访问DataFrame的组件

    2.1 组件获取及其类型

    columns = movie.columns
    
    type(columns)
    
    pandas.core.indexes.base.Index
    
    columns
    
    Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
           'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
           'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
           'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
           'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
           'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
           'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
           'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
          dtype='object')
    
    index = movie.index
    
    type(index)
    
    pandas.core.indexes.range.RangeIndex
    
    index
    
    RangeIndex(start=0, stop=4916, step=1)
    
    data = movie.values
    
    type(data)
    
    numpy.ndarray
    
    data
    
    array([['Color', 'James Cameron', 723.0, ..., 7.9, 1.78, 33000],
           ['Color', 'Gore Verbinski', 302.0, ..., 7.1, 2.35, 0],
           ['Color', 'Sam Mendes', 602.0, ..., 6.8, 2.35, 85000],
           ...,
           ['Color', 'Benjamin Roberds', 13.0, ..., 6.3, nan, 16],
           ['Color', 'Daniel Hsia', 14.0, ..., 6.3, 2.35, 660],
           ['Color', 'Jon Gunn', 43.0, ..., 6.6, 1.85, 456]], dtype=object)
    

    2.2 索引类型

    判断是不是子类型

    issubclass(pd.core.indexes.range.RangeIndex,pd.Index)
    
    True
    

    访问index的值,index的值是个列表,所以可以索引或切片

    index.values
    
    array([   0,    1,    2, ..., 4913, 4914, 4915])
    

    3 理解数据类型

    movie.dtypes
    
    color                       object
    director_name               object
    num_critic_for_reviews     float64
    duration                   float64
    director_facebook_likes    float64
                                ...   
    title_year                 float64
    actor_2_facebook_likes     float64
    imdb_score                 float64
    aspect_ratio               float64
    movie_facebook_likes         int64
    Length: 28, dtype: object
    

    显示各类型的数量

    movie.get_dtype_counts()
    
    float64    13
    int64       3
    object     12
    dtype: int64
    

    4 Series 结构

    选择一列数据,作为Series

    movie['director_name']
    
    0           James Cameron
    1          Gore Verbinski
    2              Sam Mendes
    3       Christopher Nolan
    4             Doug Walker
                  ...        
    4911          Scott Smith
    4912                  NaN
    4913     Benjamin Roberds
    4914          Daniel Hsia
    4915             Jon Gunn
    Name: director_name, Length: 4916, dtype: object
    

    也可以通过属性的方式选取

    movie.director_name
    
    0           James Cameron
    1          Gore Verbinski
    2              Sam Mendes
    3       Christopher Nolan
    4             Doug Walker
                  ...        
    4911          Scott Smith
    4912                  NaN
    4913     Benjamin Roberds
    4914          Daniel Hsia
    4915             Jon Gunn
    Name: director_name, Length: 4916, dtype: object
    
    type(movie['director_name'])
    
    pandas.core.series.Series
    

    4.1 调用Series方法

    查看Series所有不重复的指令

    s_attr_methods = set(dir(pd.Series))
    
    len(s_attr_methods)
    
    464
    

    查看DataFrame所有不重复的指令

    df_attr_methods = set(dir(pd.DataFrame))
    
    len(df_attr_methods)
    
    460
    

    这两个集合中有多少共有的指令

    len(s_attr_methods & df_attr_methods)
    
    399
    

    4.2 Series基础方法

    选取director和actor_1_fb_likes两列

    director = movie['director_name']
    actor_1_fb_likes  = movie['actor_1_facebook_likes']
    

    查看series头部信息

    director.head()
    
    0        James Cameron
    1       Gore Verbinski
    2           Sam Mendes
    3    Christopher Nolan
    4          Doug Walker
    Name: director_name, dtype: object
    

    统计series值出现的频数

    director.value_counts()
    
    Steven Spielberg    26
    Woody Allen         22
    Clint Eastwood      20
    Martin Scorsese     20
    Spike Lee           16
                        ..
    John Duigan          1
    Ray Griggs           1
    Lena Dunham          1
    Dario Argento        1
    Eric Mendelsohn      1
    Name: director_name, Length: 2397, dtype: int64
    

    统计series值出现的频率

    director.value_counts(normalize=True)
    
    Steven Spielberg    0.005401
    Woody Allen         0.004570
    Clint Eastwood      0.004155
    Martin Scorsese     0.004155
    Spike Lee           0.003324
                          ...   
    John Duigan         0.000208
    Ray Griggs          0.000208
    Lena Dunham         0.000208
    Dario Argento       0.000208
    Eric Mendelsohn     0.000208
    Name: director_name, Length: 2397, dtype: float64
    

    长度相关

    len(director) 
    
    4916
    
    director.size 
    
    4916
    
    director.shape 
    
    (4916,)
    

    director有多少非空值

    director.count() 
    
    4814
    

    空值个数(会有更加直接的方法)

    director.size - director.count()
    
    102
    

    4.3 Series统计信息

    最小值、最大值、平均值、中位数、标准差、总和

    actor_1_fb_likes.min(), actor_1_fb_likes.max()
    
    (0.0, 640000.0)
    
    actor_1_fb_likes.mean(), actor_1_fb_likes.median()
    
    (6494.488490527602, 982.0)
    
    actor_1_fb_likes.std(), actor_1_fb_likes.sum()
    
    (15106.986883848309, 31881444.0)
    

    数值描述信息

    actor_1_fb_likes.describe()
    
    count      4909.000000
    mean       6494.488491
    std       15106.986884
    min           0.000000
    25%         607.000000
    50%         982.000000
    75%       11000.000000
    max      640000.000000
    Name: actor_1_facebook_likes, dtype: float64
    

    字符描述信息

    director.describe()
    
    count                 4814
    unique                2397
    top       Steven Spielberg
    freq                    26
    Name: director_name, dtype: object
    

    任意分为点

    actor_1_fb_likes.quantile(.2)
    
    510.0
    
    actor_1_fb_likes.quantile([.1, .2, .3, .4, .5, .6, .7, .8, .9])
    
    0.1      240.0
    0.2      510.0
    0.3      694.0
    0.4      854.0
    0.5      982.0
    0.6     1000.0
    0.7     8000.0
    0.8    13000.0
    0.9    18000.0
    Name: actor_1_facebook_likes, dtype: float64
    

    4.4 空值处理

    判断是否有缺失值

    actor_1_fb_likes.hasnans
    
    True
    

    缺失值的个数

    actor_1_fb_likes.isnull().sum()
    
    7
    

    选取缺失值

    actor_1_fb_likes[actor_1_fb_likes.isnull()]
    
    4401   NaN
    4418   NaN
    4608   NaN
    4721   NaN
    4822   NaN
    4823   NaN
    4864   NaN
    Name: actor_1_facebook_likes, dtype: float64
    

    非空值

    actor_1_fb_likes.isnull()
    
    0       False
    1       False
    2       False
    3       False
    4       False
            ...  
    4911    False
    4912    False
    4913    False
    4914    False
    4915    False
    Name: actor_1_facebook_likes, Length: 4916, dtype: bool
    
    bool_sig = actor_1_fb_likes.notnull()
    

    判断所有的bool是否都为true

    bool_sig.all()
    
    False
    

    填充缺失值

    actor_1_fb_likes.count()
    
    4909
    
    actor_1_fb_likes_filled = actor_1_fb_likes.fillna(0)
    
    actor_1_fb_likes_filled.count()
    
    4916
    

    删除缺失值

    actor_1_fb_likes_dropped = actor_1_fb_likes.dropna()
    
    actor_1_fb_likes_dropped.size
    
    4909
    

    4.5 在Series上使用运算符

    imdb_score = movie['imdb_score']
    

    加减乘除

    imdb_score + 1
    
    0       8.9
    1       8.1
    2       7.8
    3       9.5
    4       8.1
           ... 
    4911    8.7
    4912    8.5
    4913    7.3
    4914    7.3
    4915    7.6
    Name: imdb_score, Length: 4916, dtype: float64
    

    函数实现

    imdb_score.add(1)        
    
    0       8.9
    1       8.1
    2       7.8
    3       9.5
    4       8.1
           ... 
    4911    8.7
    4912    8.5
    4913    7.3
    4914    7.3
    4915    7.6
    Name: imdb_score, Length: 4916, dtype: float64
    

    4.6 类型转化

    imdb_score.dtype
    
    dtype('float64')
    
    imdb_score = imdb_score.astype(int)
    
    imdb_score.dtype
    
    dtype('int64')
    

    5 使dataframe索引有意义

    movie.shape
    
    (4916, 28)
    
    movie.tail()
    
    color director_name ... aspect_ratio movie_facebook_likes
    4911 Color Scott Smith ... NaN 84
    4912 Color NaN ... 16.00 32000
    4913 Color Benjamin Roberds ... NaN 16
    4914 Color Daniel Hsia ... 2.35 660
    4915 Color Jon Gunn ... 1.85 456

    5 rows × 28 columns

    5.1 给索引重命名

    movie.index.name = 'row_index'
    
    movie.columns.name = 'col_index'
    
    movie.tail()
    
    col_index color director_name ... aspect_ratio movie_facebook_likes
    row_index
    4911 Color Scott Smith ... NaN 84
    4912 Color NaN ... 16.00 32000
    4913 Color Benjamin Roberds ... NaN 16
    4914 Color Daniel Hsia ... 2.35 660
    4915 Color Jon Gunn ... 1.85 456

    5 rows × 28 columns

    5.2 重设索引

    将dataframe中存在某列或多列作为索引

    movie2 = movie.set_index('movie_title')
    
    movie2.tail()
    
    col_index color director_name ... aspect_ratio movie_facebook_likes
    movie_title
    Signed Sealed Delivered Color Scott Smith ... NaN 84
    The Following Color NaN ... 16.00 32000
    A Plague So Pleasant Color Benjamin Roberds ... NaN 16
    Shanghai Calling Color Daniel Hsia ... 2.35 660
    My Date with Drew Color Jon Gunn ... 1.85 456

    5 rows × 27 columns

    另一种方式

    movie = pd.read_csv('data/movie.csv',index_col = 'movie_title')
    

    还原为默认整数索引

    movie2.reset_index().tail()
    
    col_index movie_title color ... aspect_ratio movie_facebook_likes
    4911 Signed Sealed Delivered Color ... NaN 84
    4912 The Following Color ... 16.00 32000
    4913 A Plague So Pleasant Color ... NaN 16
    4914 Shanghai Calling Color ... 2.35 660
    4915 My Date with Drew Color ... 1.85 456

    5 rows × 28 columns

    6 重命名行名和列名

    通过rename()重命名

    idx_rename = {'Avatar':'Ratava', 'Spectre': 'Ertceps'} 
    
    col_rename = {'director_name':'Director Name','num_critic_for_reviews': 'Critical Reviews'} 
    
    movie.rename(index=idx_rename, columns=col_rename).head()
    
    color Director Name ... aspect_ratio movie_facebook_likes
    movie_title
    Ratava Color James Cameron ... 1.78 33000
    Pirates of the Caribbean: At World's End Color Gore Verbinski ... 2.35 0
    Ertceps Color Sam Mendes ... 2.35 85000
    The Dark Knight Rises Color Christopher Nolan ... 2.35 164000
    Star Wars: Episode VII - The Force Awakens NaN Doug Walker ... NaN 0

    5 rows × 27 columns

    列表的方式

    index = movie.index
    columns = movie.columns
    
    index_list = index.tolist()
    column_list = columns.tolist()
    
    index_list[0] = 'Ratava'
    index_list[2] = 'Ertceps'
    column_list[1] = 'Director Name'
    column_list[2] = 'Critical Reviews'
    
    movie.index = index_list
    movie.columns = column_list
    
    movie.head()
    
    color Director Name ... aspect_ratio movie_facebook_likes
    Ratava Color James Cameron ... 1.78 33000
    Pirates of the Caribbean: At World's End Color Gore Verbinski ... 2.35 0
    Ertceps Color Sam Mendes ... 2.35 85000
    The Dark Knight Rises Color Christopher Nolan ... 2.35 164000
    Star Wars: Episode VII - The Force Awakens NaN Doug Walker ... NaN 0

    5 rows × 27 columns

    7 创建、删除列

    通过[列名]添加新列

    movie = pd.read_csv('data/movie.csv')
    
    movie['has_seen'] = 0
    
    movie['actor_director_facebook_likes'] = (movie['actor_1_facebook_likes'] + movie['actor_2_facebook_likes'])
    
    movie.shape,movie['actor_director_facebook_likes'].shape
    
    ((4916, 30), (4916,))
    

    删除行/列

    movie.drop(['actor_director_facebook_likes','actor_1_facebook_likes'],axis=1)
    
    color director_name ... movie_facebook_likes has_seen
    0 Color James Cameron ... 33000 0
    1 Color Gore Verbinski ... 0 0
    2 Color Sam Mendes ... 85000 0
    3 Color Christopher Nolan ... 164000 0
    4 NaN Doug Walker ... 0 0
    ... ... ... ... ... ...
    4911 Color Scott Smith ... 84 0
    4912 Color NaN ... 32000 0
    4913 Color Benjamin Roberds ... 16 0
    4914 Color Daniel Hsia ... 660 0
    4915 Color Jon Gunn ... 456 0

    4916 rows × 28 columns

    movie.drop([0,2])
    
    color director_name ... has_seen actor_director_facebook_likes
    1 Color Gore Verbinski ... 0 45000.0
    3 Color Christopher Nolan ... 0 50000.0
    4 NaN Doug Walker ... 0 143.0
    5 Color Andrew Stanton ... 0 1272.0
    6 Color Sam Raimi ... 0 35000.0
    ... ... ... ... ... ...
    4911 Color Scott Smith ... 0 1107.0
    4912 Color NaN ... 0 1434.0
    4913 Color Benjamin Roberds ... 0 0.0
    4914 Color Daniel Hsia ... 0 1665.0
    4915 Color Jon Gunn ... 0 109.0

    4914 rows × 30 columns

    天下风云出我辈,一入江湖岁月催
  • 相关阅读:
    js/jsp常用记录(一)
    Oracle 存储过程的基本语法 及注意事项
    PL/SQL Developer使用技巧、快捷键
    Zookeeper的功能以及工作原理
    牛客网PAT练兵场-德才论
    牛客网PAT练习场-数素数
    牛客网PAT练兵场-D进制的A+B
    牛客网PAT练习场-个位数的统计
    牛客网PAT练习场-数字分类
    牛客网PAT练习场-A+B和C
  • 原文地址:https://www.cnblogs.com/shiyushiyu/p/9712998.html
Copyright © 2011-2022 走看看