zoukankan      html  css  js  c++  java
  • Pandas CookBook -- 02DataFrame基础操作

    Pandas基础操作

    简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧

    import pandas as pd
    import numpy as np
    

    设定最大列数和最大行数

    pd.set_option('max_columns',5 , 'max_rows', 5)
    

    1 选取多个DataFrame列

    1.1 用列表选取多个列

    movie = pd.read_csv('data/movie.csv')
    
    cols =['actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name']
    
    movie_actor_director = movie[cols]
    
    movie_actor_director
    
    actor_1_name actor_2_name actor_3_name director_name
    0 CCH Pounder Joel David Moore Wes Studi James Cameron
    1 Johnny Depp Orlando Bloom Jack Davenport Gore Verbinski
    ... ... ... ... ...
    4914 Alan Ruck Daniel Henney Eliza Coupe Daniel Hsia
    4915 John August Brian Herzlinger Jon Gunn Jon Gunn

    4916 rows × 4 columns

    1.2 使用select_dtypes选取类型

    select_dtypes(include=None, exclude=None)

    • To select all numeric types, use np.number or 'number'
    • To select strings you must use the object dtype, but note that this will return all object dtype columns,See the numpy dtype hierarchy
    • To select datetimes, use np.datetime64, 'datetime' or 'datetime64'
    • To select timedeltas, use np.timedelta64, 'timedelta' or 'timedelta64'
    • To select Pandas categorical dtypes, use 'category'
    movie.shape
    
    (4916, 28)
    

    1.2.1 选取整数列

    movie.select_dtypes(include=['int']).head()
    
    num_voted_users cast_total_facebook_likes movie_facebook_likes
    0 886204 4834 33000
    1 471220 48350 0
    2 275868 11700 85000
    3 1144337 106759 164000
    4 8 143 0

    1.2.2 选取非整数列

    movie.select_dtypes(exclude=['int']).head()
    
    color director_name ... imdb_score aspect_ratio
    0 Color James Cameron ... 7.9 1.78
    1 Color Gore Verbinski ... 7.1 2.35
    2 Color Sam Mendes ... 6.8 2.35
    3 Color Christopher Nolan ... 8.5 2.35
    4 NaN Doug Walker ... 7.1 NaN

    5 rows × 25 columns

    1.2.3 通过filter函数过滤选取多列

    filter(items=None, like=None, regex=None, axis=None)

    • items : list-like
      • List of info axis to restrict to (must not all be present)
      • 传递个列名或行名列表
    • like : string
      • Keep info axis where “arg in col == True”
      • 类似Python里面字符串的find()函数,col.find(arg)
    • regex : string (regular expression)
      • Keep info axis with re.search(regex, col) == True

    通过filter()函数过滤选取多列

    movie.filter(like='facebook').head()
    
    director_facebook_likes actor_3_facebook_likes ... actor_2_facebook_likes movie_facebook_likes
    0 0.0 855.0 ... 936.0 33000
    1 563.0 1000.0 ... 5000.0 0
    2 0.0 161.0 ... 393.0 85000
    3 22000.0 23000.0 ... 23000.0 164000
    4 131.0 NaN ... 12.0 0

    5 rows × 6 columns

    通过正则表达式选取多列

    movie.filter(regex='d').head()
    
    actor_3_facebook_likes actor_2_name ... actor_3_name actor_2_facebook_likes
    0 855.0 Joel David Moore ... Wes Studi 936.0
    1 1000.0 Orlando Bloom ... Jack Davenport 5000.0
    2 161.0 Rory Kinnear ... Stephanie Sigman 393.0
    3 23000.0 Christian Bale ... Joseph Gordon-Levitt 23000.0
    4 NaN Rob Walker ... NaN 12.0

    5 rows × 6 columns

    filter()函数,传递列表到参数items,选取多列

    movie.filter(items=['actor_1_name', 'actor_3_name']).head()
    
    actor_1_name actor_3_name
    0 CCH Pounder Wes Studi
    1 Johnny Depp Jack Davenport
    2 Christoph Waltz Stephanie Sigman
    3 Tom Hardy Joseph Gordon-Levitt
    4 Doug Walker NaN

    2 DataFrame上操作

    2.1 基本方法

    数据的个数 数据集的维度 数据集的长度

     movie.shape,movie.size,movie.ndim
    
    ((4916, 28), 137648, 2)
    

    各个列的非空值的个数

    movie.count()
    
    color                   4897
    director_name           4814
                            ... 
    aspect_ratio            4590
    movie_facebook_likes    4916
    Length: 28, dtype: int64
    

    2.2 统计信息

    movie.shape
    
    (4916, 28)
    

    2.2.1 最大 最小值

    2.2.1.1 数值类型

    # min max quantile
    movie_min = movie.min()
    movie_min.name = '最小值'
    movie_min
    
    num_critic_for_reviews    1.00
    duration                  7.00
                              ... 
    aspect_ratio              1.18
    movie_facebook_likes      0.00
    Name: 最小值, Length: 16, dtype: float64
    

    计算是默认会跳过缺失值的,可设置skipna=False使其包含缺失,但这样不具有意义

    movie.min(skipna=False)
    
    num_critic_for_reviews    NaN
    duration                  NaN
                             ... 
    aspect_ratio              NaN
    movie_facebook_likes      0.0
    Length: 16, dtype: float64
    

    2.2.1.2 字符串类型

    当字符串类型的列包含缺失值时,聚合方法min、max、sum,不会返回任何值。

    movie[['color', 'movie_title', 'color']].max()
    
    Series([], dtype: float64)
    

    要让pandas强行返回每列的值,必须填入缺失值。下面填入的是空字符串

    movie[['color', 'movie_title', 'color']].fillna('').max()
    
    color             Color
    movie_title    Æon Flux
    color             Color
    dtype: object
    

    2.2.2 统计信息

    2.2.2.1 数值型

    使用percentiles参数指定分位数

    movie.describe(percentiles=[.01, .3, .99])
    
    num_critic_for_reviews duration ... aspect_ratio movie_facebook_likes
    count 4867.000000 4901.000000 ... 4590.000000 4916.000000
    mean 137.988905 107.090798 ... 2.222349 7348.294142
    ... ... ... ... ... ...
    99% 546.680000 189.000000 ... 4.000000 93850.000000
    max 813.000000 511.000000 ... 16.000000 349000.000000

    9 rows × 16 columns

    2.2.2.2 字符串型

    movie.select_dtypes(include='object').describe()
    
    color director_name ... country content_rating
    count 4897 4814 ... 4911 4616
    unique 2 2397 ... 65 18
    top Color Steven Spielberg ... USA R
    freq 4693 26 ... 3710 2067

    4 rows × 12 columns

    2.3 方法的组合

    使用isnull方法将每个值转变为布尔值

    movie.isnull().head()
    
    color director_name ... aspect_ratio movie_facebook_likes
    0 False False ... False False
    1 False False ... False False
    2 False False ... False False
    3 False False ... False False
    4 True False ... True False

    5 rows × 28 columns

    sum统计布尔值,返回的是Series

    movie.isnull().sum().head()
    
    color                       19
    director_name              102
    num_critic_for_reviews      49
    duration                    15
    director_facebook_likes    102
    dtype: int64
    

    对这个Series再使用sum,返回整个DataFrame的缺失值的个数,返回值是个标量

    movie.isnull().sum().sum()
    
    2654
    

    判断整个DataFrame有没有缺失值,方法是连着使用两个any

    movie.isnull().any().any()
    
    True
    

    2.4 运算符

    行索引名设为INSTNM,用UGDS_过滤出本科生的种族比例

    college = pd.read_csv('data/college.csv', index_col='INSTNM')
    
    college_ugds_ = college.filter(like='UGDS_')
    
    college_ugds_
    
    UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
    INSTNM
    Alabama A & M University 0.0333 0.9353 ... 0.0059 0.0138
    University of Alabama at Birmingham 0.5922 0.2600 ... 0.0179 0.0100
    ... ... ... ... ... ...
    Bay Area Medical Academy - San Jose Satellite Location NaN NaN ... NaN NaN
    Excel Learning Center-San Antonio South NaN NaN ... NaN NaN

    7535 rows × 9 columns

    college_ugds_的数值类型都是float,可以进行整数运算

    college_ugds_.dtypes
    
    UGDS_WHITE    float64
    UGDS_BLACK    float64
                   ...   
    UGDS_NRA      float64
    UGDS_UNKN     float64
    Length: 9, dtype: object
    

    2.4.1 加减乘除

    college_ugds_.head() + .00501
    
    UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
    INSTNM
    Alabama A & M University 0.03831 0.94031 ... 0.01091 0.01881
    University of Alabama at Birmingham 0.59721 0.26501 ... 0.02291 0.01501
    Amridge University 0.30401 0.42421 ... 0.00501 0.27651
    University of Alabama in Huntsville 0.70381 0.13051 ... 0.03821 0.04001
    Alabama State University 0.02081 0.92581 ... 0.02931 0.01871

    5 rows × 9 columns

    2.4.2 计算样例数据的百分比

    2.4.2.1 方式一

    college_ugds_op_round = (college_ugds_ + .00501) // .01 / 100
    
    college_ugds_op_round.head()
    
    UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
    INSTNM
    Alabama A & M University 0.03 0.94 ... 0.01 0.01
    University of Alabama at Birmingham 0.59 0.26 ... 0.02 0.01
    Amridge University 0.30 0.42 ... 0.00 0.27
    University of Alabama in Huntsville 0.70 0.13 ... 0.03 0.04
    Alabama State University 0.02 0.92 ... 0.02 0.01

    5 rows × 9 columns

    2.4.2.2 方式二

    college_ugds_round = (college_ugds_ + .00001).round(2)
    
    college_ugds_round.head()
    
    UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
    INSTNM
    Alabama A & M University 0.03 0.94 ... 0.01 0.01
    University of Alabama at Birmingham 0.59 0.26 ... 0.02 0.01
    Amridge University 0.30 0.42 ... 0.00 0.27
    University of Alabama in Huntsville 0.70 0.13 ... 0.03 0.04
    Alabama State University 0.02 0.92 ... 0.02 0.01

    5 rows × 9 columns

    2.4.2.3 方式三

    college_ugds_op_round_methods = college_ugds_.add(.00501).floordiv(.01).div(100)
    
    college_ugds_op_round_methods.head()
    
    UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
    INSTNM
    Alabama A & M University 0.03 0.94 ... 0.01 0.01
    University of Alabama at Birmingham 0.59 0.26 ... 0.02 0.01
    Amridge University 0.30 0.42 ... 0.00 0.27
    University of Alabama in Huntsville 0.70 0.13 ... 0.03 0.04
    Alabama State University 0.02 0.92 ... 0.02 0.01

    5 rows × 9 columns

    3 比较缺失值

    Pandas使用NumPy NaN(np.nan)对象表示缺失值。这是一个不等于自身的特殊对象:

    np.nan == np.nan
    
    False
    

    所有和np.nan的比较都返回False,除了不等于:

    5 > np.nan
    
    False
    
    5 != np.nan
    
    True
    

    无法通过直接比较比较,含有缺失值的df是否一致

    movie_equal = movie == movie
    
    movie_equal.all().all()
    
    False
    
    movie_equal.size - movie_equal.sum().sum()
    
    2654
    
    movie.isnull().sum().sum()
    
    2654
    

    比较两个DataFrame最直接的方法是使用equals()方法

    from pandas.testing import assert_frame_equal
    
    assert_frame_equal(movie, movie)
    
    天下风云出我辈,一入江湖岁月催
  • 相关阅读:
    大数据小项目之电视收视率企业项目10
    大数据小项目之电视收视率企业项目09--hive环境搭建
    spark入门: wordcount-java
    转发一个关于下载qq无损音乐的博客
    大数据小项目之电视收视率企业项目08--》MapReduce编写之Wordcount
    大数据小项目之电视收视率企业项目07
    大数据小项目之电视收视率企业项目06
    《python语言程序设计》_第6章_函数
    awk 相关
    [转]四篇应该仔细读的关于文本分析的tutorial类文章
  • 原文地址:https://www.cnblogs.com/shiyushiyu/p/9734621.html
Copyright © 2011-2022 走看看