简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧
import pandas as pd
import numpy as np
设定最大列数和最大行数
pd.set_option('max_columns', 5, 'max_rows', 10)
1 DataFrame的结构
movie = pd.read_csv('data/movie.csv')
movie.shape
(4916, 28)
2 访问DataFrame的组件
2.1 组件获取及其类型
columns = movie.columns
type(columns)
pandas.core.indexes.base.Index
columns
Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
dtype='object')
index = movie.index
type(index)
pandas.core.indexes.range.RangeIndex
index
RangeIndex(start=0, stop=4916, step=1)
data = movie.values
type(data)
numpy.ndarray
data
array([['Color', 'James Cameron', 723.0, ..., 7.9, 1.78, 33000],
['Color', 'Gore Verbinski', 302.0, ..., 7.1, 2.35, 0],
['Color', 'Sam Mendes', 602.0, ..., 6.8, 2.35, 85000],
...,
['Color', 'Benjamin Roberds', 13.0, ..., 6.3, nan, 16],
['Color', 'Daniel Hsia', 14.0, ..., 6.3, 2.35, 660],
['Color', 'Jon Gunn', 43.0, ..., 6.6, 1.85, 456]], dtype=object)
2.2 索引类型
判断是不是子类型
issubclass(pd.core.indexes.range.RangeIndex,pd.Index)
True
访问index的值,index的值是个列表,所以可以索引或切片
index.values
array([ 0, 1, 2, ..., 4913, 4914, 4915])
3 理解数据类型
movie.dtypes
color object
director_name object
num_critic_for_reviews float64
duration float64
director_facebook_likes float64
...
title_year float64
actor_2_facebook_likes float64
imdb_score float64
aspect_ratio float64
movie_facebook_likes int64
Length: 28, dtype: object
显示各类型的数量
movie.get_dtype_counts()
float64 13
int64 3
object 12
dtype: int64
4 Series 结构
选择一列数据,作为Series
movie['director_name']
0 James Cameron
1 Gore Verbinski
2 Sam Mendes
3 Christopher Nolan
4 Doug Walker
...
4911 Scott Smith
4912 NaN
4913 Benjamin Roberds
4914 Daniel Hsia
4915 Jon Gunn
Name: director_name, Length: 4916, dtype: object
也可以通过属性的方式选取
movie.director_name
0 James Cameron
1 Gore Verbinski
2 Sam Mendes
3 Christopher Nolan
4 Doug Walker
...
4911 Scott Smith
4912 NaN
4913 Benjamin Roberds
4914 Daniel Hsia
4915 Jon Gunn
Name: director_name, Length: 4916, dtype: object
type(movie['director_name'])
pandas.core.series.Series
4.1 调用Series方法
查看Series所有不重复的指令
s_attr_methods = set(dir(pd.Series))
len(s_attr_methods)
464
查看DataFrame所有不重复的指令
df_attr_methods = set(dir(pd.DataFrame))
len(df_attr_methods)
460
这两个集合中有多少共有的指令
len(s_attr_methods & df_attr_methods)
399
4.2 Series基础方法
选取director和actor_1_fb_likes两列
director = movie['director_name']
actor_1_fb_likes = movie['actor_1_facebook_likes']
查看series头部信息
director.head()
0 James Cameron
1 Gore Verbinski
2 Sam Mendes
3 Christopher Nolan
4 Doug Walker
Name: director_name, dtype: object
统计series值出现的频数
director.value_counts()
Steven Spielberg 26
Woody Allen 22
Clint Eastwood 20
Martin Scorsese 20
Spike Lee 16
..
John Duigan 1
Ray Griggs 1
Lena Dunham 1
Dario Argento 1
Eric Mendelsohn 1
Name: director_name, Length: 2397, dtype: int64
统计series值出现的频率
director.value_counts(normalize=True)
Steven Spielberg 0.005401
Woody Allen 0.004570
Clint Eastwood 0.004155
Martin Scorsese 0.004155
Spike Lee 0.003324
...
John Duigan 0.000208
Ray Griggs 0.000208
Lena Dunham 0.000208
Dario Argento 0.000208
Eric Mendelsohn 0.000208
Name: director_name, Length: 2397, dtype: float64
长度相关
len(director)
4916
director.size
4916
director.shape
(4916,)
director有多少非空值
director.count()
4814
空值个数(会有更加直接的方法)
director.size - director.count()
102
4.3 Series统计信息
最小值、最大值、平均值、中位数、标准差、总和
actor_1_fb_likes.min(), actor_1_fb_likes.max()
(0.0, 640000.0)
actor_1_fb_likes.mean(), actor_1_fb_likes.median()
(6494.488490527602, 982.0)
actor_1_fb_likes.std(), actor_1_fb_likes.sum()
(15106.986883848309, 31881444.0)
数值描述信息
actor_1_fb_likes.describe()
count 4909.000000
mean 6494.488491
std 15106.986884
min 0.000000
25% 607.000000
50% 982.000000
75% 11000.000000
max 640000.000000
Name: actor_1_facebook_likes, dtype: float64
字符描述信息
director.describe()
count 4814
unique 2397
top Steven Spielberg
freq 26
Name: director_name, dtype: object
任意分为点
actor_1_fb_likes.quantile(.2)
510.0
actor_1_fb_likes.quantile([.1, .2, .3, .4, .5, .6, .7, .8, .9])
0.1 240.0
0.2 510.0
0.3 694.0
0.4 854.0
0.5 982.0
0.6 1000.0
0.7 8000.0
0.8 13000.0
0.9 18000.0
Name: actor_1_facebook_likes, dtype: float64
4.4 空值处理
判断是否有缺失值
actor_1_fb_likes.hasnans
True
缺失值的个数
actor_1_fb_likes.isnull().sum()
7
选取缺失值
actor_1_fb_likes[actor_1_fb_likes.isnull()]
4401 NaN
4418 NaN
4608 NaN
4721 NaN
4822 NaN
4823 NaN
4864 NaN
Name: actor_1_facebook_likes, dtype: float64
非空值
actor_1_fb_likes.isnull()
0 False
1 False
2 False
3 False
4 False
...
4911 False
4912 False
4913 False
4914 False
4915 False
Name: actor_1_facebook_likes, Length: 4916, dtype: bool
bool_sig = actor_1_fb_likes.notnull()
判断所有的bool是否都为true
bool_sig.all()
False
填充缺失值
actor_1_fb_likes.count()
4909
actor_1_fb_likes_filled = actor_1_fb_likes.fillna(0)
actor_1_fb_likes_filled.count()
4916
删除缺失值
actor_1_fb_likes_dropped = actor_1_fb_likes.dropna()
actor_1_fb_likes_dropped.size
4909
4.5 在Series上使用运算符
imdb_score = movie['imdb_score']
加减乘除
imdb_score + 1
0 8.9
1 8.1
2 7.8
3 9.5
4 8.1
...
4911 8.7
4912 8.5
4913 7.3
4914 7.3
4915 7.6
Name: imdb_score, Length: 4916, dtype: float64
函数实现
imdb_score.add(1)
0 8.9
1 8.1
2 7.8
3 9.5
4 8.1
...
4911 8.7
4912 8.5
4913 7.3
4914 7.3
4915 7.6
Name: imdb_score, Length: 4916, dtype: float64
4.6 类型转化
imdb_score.dtype
dtype('float64')
imdb_score = imdb_score.astype(int)
imdb_score.dtype
dtype('int64')
5 使dataframe索引有意义
movie.shape
(4916, 28)
movie.tail()
color | director_name | ... | aspect_ratio | movie_facebook_likes | |
---|---|---|---|---|---|
4911 | Color | Scott Smith | ... | NaN | 84 |
4912 | Color | NaN | ... | 16.00 | 32000 |
4913 | Color | Benjamin Roberds | ... | NaN | 16 |
4914 | Color | Daniel Hsia | ... | 2.35 | 660 |
4915 | Color | Jon Gunn | ... | 1.85 | 456 |
5 rows × 28 columns
5.1 给索引重命名
movie.index.name = 'row_index'
movie.columns.name = 'col_index'
movie.tail()
col_index | color | director_name | ... | aspect_ratio | movie_facebook_likes |
---|---|---|---|---|---|
row_index | |||||
4911 | Color | Scott Smith | ... | NaN | 84 |
4912 | Color | NaN | ... | 16.00 | 32000 |
4913 | Color | Benjamin Roberds | ... | NaN | 16 |
4914 | Color | Daniel Hsia | ... | 2.35 | 660 |
4915 | Color | Jon Gunn | ... | 1.85 | 456 |
5 rows × 28 columns
5.2 重设索引
将dataframe中存在某列或多列作为索引
movie2 = movie.set_index('movie_title')
movie2.tail()
col_index | color | director_name | ... | aspect_ratio | movie_facebook_likes |
---|---|---|---|---|---|
movie_title | |||||
Signed Sealed Delivered | Color | Scott Smith | ... | NaN | 84 |
The Following | Color | NaN | ... | 16.00 | 32000 |
A Plague So Pleasant | Color | Benjamin Roberds | ... | NaN | 16 |
Shanghai Calling | Color | Daniel Hsia | ... | 2.35 | 660 |
My Date with Drew | Color | Jon Gunn | ... | 1.85 | 456 |
5 rows × 27 columns
另一种方式
movie = pd.read_csv('data/movie.csv',index_col = 'movie_title')
还原为默认整数索引
movie2.reset_index().tail()
col_index | movie_title | color | ... | aspect_ratio | movie_facebook_likes |
---|---|---|---|---|---|
4911 | Signed Sealed Delivered | Color | ... | NaN | 84 |
4912 | The Following | Color | ... | 16.00 | 32000 |
4913 | A Plague So Pleasant | Color | ... | NaN | 16 |
4914 | Shanghai Calling | Color | ... | 2.35 | 660 |
4915 | My Date with Drew | Color | ... | 1.85 | 456 |
5 rows × 28 columns
6 重命名行名和列名
通过rename()重命名
idx_rename = {'Avatar':'Ratava', 'Spectre': 'Ertceps'}
col_rename = {'director_name':'Director Name','num_critic_for_reviews': 'Critical Reviews'}
movie.rename(index=idx_rename, columns=col_rename).head()
color | Director Name | ... | aspect_ratio | movie_facebook_likes | |
---|---|---|---|---|---|
movie_title | |||||
Ratava | Color | James Cameron | ... | 1.78 | 33000 |
Pirates of the Caribbean: At World's End | Color | Gore Verbinski | ... | 2.35 | 0 |
Ertceps | Color | Sam Mendes | ... | 2.35 | 85000 |
The Dark Knight Rises | Color | Christopher Nolan | ... | 2.35 | 164000 |
Star Wars: Episode VII - The Force Awakens | NaN | Doug Walker | ... | NaN | 0 |
5 rows × 27 columns
列表的方式
index = movie.index
columns = movie.columns
index_list = index.tolist()
column_list = columns.tolist()
index_list[0] = 'Ratava'
index_list[2] = 'Ertceps'
column_list[1] = 'Director Name'
column_list[2] = 'Critical Reviews'
movie.index = index_list
movie.columns = column_list
movie.head()
color | Director Name | ... | aspect_ratio | movie_facebook_likes | |
---|---|---|---|---|---|
Ratava | Color | James Cameron | ... | 1.78 | 33000 |
Pirates of the Caribbean: At World's End | Color | Gore Verbinski | ... | 2.35 | 0 |
Ertceps | Color | Sam Mendes | ... | 2.35 | 85000 |
The Dark Knight Rises | Color | Christopher Nolan | ... | 2.35 | 164000 |
Star Wars: Episode VII - The Force Awakens | NaN | Doug Walker | ... | NaN | 0 |
5 rows × 27 columns
7 创建、删除列
通过[列名]添加新列
movie = pd.read_csv('data/movie.csv')
movie['has_seen'] = 0
movie['actor_director_facebook_likes'] = (movie['actor_1_facebook_likes'] + movie['actor_2_facebook_likes'])
movie.shape,movie['actor_director_facebook_likes'].shape
((4916, 30), (4916,))
删除行/列
movie.drop(['actor_director_facebook_likes','actor_1_facebook_likes'],axis=1)
color | director_name | ... | movie_facebook_likes | has_seen | |
---|---|---|---|---|---|
0 | Color | James Cameron | ... | 33000 | 0 |
1 | Color | Gore Verbinski | ... | 0 | 0 |
2 | Color | Sam Mendes | ... | 85000 | 0 |
3 | Color | Christopher Nolan | ... | 164000 | 0 |
4 | NaN | Doug Walker | ... | 0 | 0 |
... | ... | ... | ... | ... | ... |
4911 | Color | Scott Smith | ... | 84 | 0 |
4912 | Color | NaN | ... | 32000 | 0 |
4913 | Color | Benjamin Roberds | ... | 16 | 0 |
4914 | Color | Daniel Hsia | ... | 660 | 0 |
4915 | Color | Jon Gunn | ... | 456 | 0 |
4916 rows × 28 columns
movie.drop([0,2])
color | director_name | ... | has_seen | actor_director_facebook_likes | |
---|---|---|---|---|---|
1 | Color | Gore Verbinski | ... | 0 | 45000.0 |
3 | Color | Christopher Nolan | ... | 0 | 50000.0 |
4 | NaN | Doug Walker | ... | 0 | 143.0 |
5 | Color | Andrew Stanton | ... | 0 | 1272.0 |
6 | Color | Sam Raimi | ... | 0 | 35000.0 |
... | ... | ... | ... | ... | ... |
4911 | Color | Scott Smith | ... | 0 | 1107.0 |
4912 | Color | NaN | ... | 0 | 1434.0 |
4913 | Color | Benjamin Roberds | ... | 0 | 0.0 |
4914 | Color | Daniel Hsia | ... | 0 | 1665.0 |
4915 | Color | Jon Gunn | ... | 0 | 109.0 |
4914 rows × 30 columns