简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧
import pandas as pd
import numpy as np
设定最大列数和最大行数
pd.set_option('max_columns',5 , 'max_rows', 5)
1 基础分析
读取查看数据
college = pd.read_csv('data/college.csv')
1.1 基础信息
college.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7535 entries, 0 to 7534
Data columns (total 27 columns):
INSTNM 7535 non-null object
CITY 7535 non-null object
STABBR 7535 non-null object
HBCU 7164 non-null float64
MENONLY 7164 non-null float64
WOMENONLY 7164 non-null float64
RELAFFIL 7535 non-null int64
SATVRMID 1185 non-null float64
SATMTMID 1196 non-null float64
DISTANCEONLY 7164 non-null float64
UGDS 6874 non-null float64
UGDS_WHITE 6874 non-null float64
UGDS_BLACK 6874 non-null float64
UGDS_HISP 6874 non-null float64
UGDS_ASIAN 6874 non-null float64
UGDS_AIAN 6874 non-null float64
UGDS_NHPI 6874 non-null float64
UGDS_2MOR 6874 non-null float64
UGDS_NRA 6874 non-null float64
UGDS_UNKN 6874 non-null float64
PPTUG_EF 6853 non-null float64
CURROPER 7535 non-null int64
PCTPELL 6849 non-null float64
PCTFLOAN 6849 non-null float64
UG25ABV 6718 non-null float64
MD_EARN_WNE_P10 6413 non-null object
GRAD_DEBT_MDN_SUPP 7503 non-null object
dtypes: float64(20), int64(2), object(5)
memory usage: 1.6+ MB
1.2 统计数值列
college.describe(include=[np.number]).T
count | mean | ... | 75% | max | |
---|---|---|---|---|---|
HBCU | 7164.0 | 0.014238 | ... | 0.000000 | 1.0 |
MENONLY | 7164.0 | 0.009213 | ... | 0.000000 | 1.0 |
... | ... | ... | ... | ... | ... |
PCTFLOAN | 6849.0 | 0.522211 | ... | 0.745000 | 1.0 |
UG25ABV | 6718.0 | 0.410021 | ... | 0.572275 | 1.0 |
22 rows × 8 columns
1.3 统计对象和类型列
college.describe(include=[np.object, pd.Categorical]).T
count | unique | top | freq | |
---|---|---|---|---|
INSTNM | 7535 | 7535 | Austin Graduate School of Theology | 1 |
CITY | 7535 | 2514 | New York | 87 |
STABBR | 7535 | 59 | CA | 773 |
MD_EARN_WNE_P10 | 6413 | 598 | PrivacySuppressed | 822 |
GRAD_DEBT_MDN_SUPP | 7503 | 2038 | PrivacySuppressed | 1510 |
2 数据类型的转换
DataFrame.astype(dtype, copy=True, errors='raise', **kwargs)
-
dtype : data type, or dict of column name -> data type
- Use a numpy.dtype or Python type to cast entire pandas object to the same type.
- Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.
- 使用numpy的数据类型,或者直接用Python内置的类型。
- 如果只传递一个dtype的话整个dataframe都会被转成该类型,可以传递一个{列名:类型}的字典,对dataframe中的列分别转换。
-
copy : bool, default True.
-
Return a copy when copy=True (be very careful setting copy=False as changes to values then may propagate to other pandas objects).
-
如果设置了copy = True 的话,不改变原有的dataframe
2.1 选取5列
different_cols = ['RELAFFIL', 'SATMTMID', 'CURROPER', 'INSTNM', 'STABBR']
col2 = college.loc[:, different_cols]
col2.dtypes
RELAFFIL int64
SATMTMID float64
CURROPER int64
INSTNM object
STABBR object
dtype: object
用memory_usage方法查看每列的内存消耗
original_mem = col2.memory_usage(deep=True)
original_mem
Index 80
RELAFFIL 60280
...
INSTNM 660240
STABBR 444565
Length: 6, dtype: int64
2.2 检查数值列的独立个数
col2['RELAFFIL'].unique()
array([0, 1])
RELAFFIL这列只包含0或1,因此没必要用64位,使用astype方法将其变为8位(1字节)整数
2.3 检查两个对象列的独立值的个数
col2.select_dtypes(include=['object']).nunique()
INSTNM 7535
STABBR 59
dtype: int64
STABBR列可以转变为“类型”(Categorical),独立值的个数小于总数的1%
2.4 类型转化
col3 = col2.astype({'STABBR':'category','RELAFFIL':np.int8})
col3.dtypes
RELAFFIL int8
SATMTMID float64
CURROPER int64
INSTNM object
STABBR category
dtype: object
new_mem = col3.memory_usage(deep = True)
new_mem
Index 80
RELAFFIL 7535
...
INSTNM 660699
STABBR 13576
Length: 6, dtype: int64
2.5 效果对比
new_mem / original_mem
Index 1.000000
RELAFFIL 0.125000
...
INSTNM 1.000695
STABBR 0.030538
Length: 6, dtype: float64
通过和原始数据比较,RELAFFIL列变为了原来的八分之一,STABBR列只有原始大小的3%
3 排序与去重
3.1 排序
DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
- by : str or list of str
- Name or list of names to sort by.
- ascending : bool or list of bool, default True
- Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.
- inplace : bool, default False
- if True, perform operation in-place
- na_position : {‘first’, ‘last’}, default ‘last’
- first puts NaNs at the beginning, last puts NaNs at the end
3.1.1 单列排序
选取出三列。按照title_year降序排列
movie = pd.read_csv('data/movie.csv')
movie2 = movie[['movie_title', 'title_year', 'imdb_score']]
movie2.sort_values('title_year', ascending=False).head()
movie_title | title_year | imdb_score | |
---|---|---|---|
3884 | The Veil | 2016.0 | 4.7 |
2375 | My Big Fat Greek Wedding 2 | 2016.0 | 6.1 |
2794 | Miracles from Heaven | 2016.0 | 6.8 |
92 | Independence Day: Resurgence | 2016.0 | 5.5 |
153 | Kung Fu Panda 3 | 2016.0 | 7.2 |
3.1.2 多列排序
通过给ascending设置列表,可以同时对一列降序排列,一列升序排列
movie4 = movie[['movie_title', 'title_year', 'content_rating', 'budget']]
movie4_sorted = movie4.sort_values(['title_year', 'content_rating', 'budget'], ascending=[False, False, True])
movie4_sorted.head()
movie_title | title_year | content_rating | budget | |
---|---|---|---|---|
4026 | Compadres | 2016.0 | R | 3000000.0 |
3884 | The Veil | 2016.0 | R | 4000000.0 |
3682 | Fifty Shades of Black | 2016.0 | R | 5000000.0 |
3685 | The Perfect Match | 2016.0 | R | 5000000.0 |
3396 | The Neon Demon | 2016.0 | R | 7000000.0 |
3.2 去重
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)
- subset : column label or sequence of labels, optional
- Only consider certain columns for identifying duplicates, by default use all of the columns
- 默认整行的所有列数据都重复了,才会被认为重复。
- subset 传递为列名的列表时,在判断重复时候,只要该几列值重复了,就认为该行是重复的。
- keep : {‘first’, ‘last’, False}, default ‘first’
- first : Drop duplicates except for the first occurrence.
- last : Drop duplicates except for the last occurrence.
- False : Drop all duplicates.
- inplace : boolean, default False
- Whether to drop duplicates in place or to return a copy
movie3 = movie2.sort_values(['title_year','imdb_score'], ascending=False)
movie3
movie_title | title_year | imdb_score | |
---|---|---|---|
4312 | Kickboxer: Vengeance | 2016.0 | 9.1 |
4277 | A Beginner's Guide to Snuff | 2016.0 | 8.7 |
... | ... | ... | ... |
3246 | The Bold and the Beautiful | NaN | 3.5 |
2119 | The Bachelor | NaN | 2.9 |
4916 rows × 3 columns
用drop_duplicates去重,只保留每年的第一条数据
movie_top_year = movie3.drop_duplicates(subset='title_year',keep='first')
movie_top_year
movie_title | title_year | imdb_score | |
---|---|---|---|
4312 | Kickboxer: Vengeance | 2016.0 | 9.1 |
3745 | Running Forever | 2015.0 | 8.6 |
... | ... | ... | ... |
4695 | Intolerance: Love's Struggle Throughout the Ages | 1916.0 | 8.0 |
2725 | Towering Inferno | NaN | 9.5 |
92 rows × 3 columns
4 从最大中选择最小
读取movie.csv,选取'movie_title', 'imdb_score', 'budget'三列
movie2 = movie[['movie_title', 'imdb_score', 'budget']]
movie2.head()
movie_title | imdb_score | budget | |
---|---|---|---|
0 | Avatar | 7.9 | 237000000.0 |
1 | Pirates of the Caribbean: At World's End | 7.1 | 300000000.0 |
2 | Spectre | 6.8 | 245000000.0 |
3 | The Dark Knight Rises | 8.5 | 250000000.0 |
4 | Star Wars: Episode VII - The Force Awakens | 7.1 | NaN |
4.1 nlargest与nsmallest
用nlargest方法,选出imdb_score分数最高的100个
movie2.nlargest(100, 'imdb_score').head()
movie_title | imdb_score | budget | |
---|---|---|---|
2725 | Towering Inferno | 9.5 | NaN |
1920 | The Shawshank Redemption | 9.3 | 25000000.0 |
3402 | The Godfather | 9.2 | 6000000.0 |
2779 | Dekalog | 9.1 | NaN |
4312 | Kickboxer: Vengeance | 9.1 | 17000000.0 |
用链式操作,nsmallest方法再从中挑出预算最小的五部
movie2.nlargest(100, 'imdb_score').nsmallest(5, 'budget')
movie_title | imdb_score | budget | |
---|---|---|---|
4804 | Butterfly Girl | 8.7 | 180000.0 |
4801 | Children of Heaven | 8.5 | 180000.0 |
4706 | 12 Angry Men | 8.9 | 350000.0 |
4550 | A Separation | 8.4 | 500000.0 |
4636 | The Other Dream Team | 8.4 | 500000.0 |
4.2 用sort_values方法
用sort_values方法,选取imdb_score最高的100个
movie2.sort_values('imdb_score', ascending=False).head(100).head()
movie_title | imdb_score | budget | |
---|---|---|---|
2725 | Towering Inferno | 9.5 | NaN |
1920 | The Shawshank Redemption | 9.3 | 25000000.0 |
3402 | The Godfather | 9.2 | 6000000.0 |
2779 | Dekalog | 9.1 | NaN |
4312 | Kickboxer: Vengeance | 9.1 | 17000000.0 |
然后可以再.sort_values('budget').head(),选出预算最低的5个,结果如下
movie2.nlargest(100, 'imdb_score').tail()
movie_title | imdb_score | budget | |
---|---|---|---|
4023 | Oldboy | 8.4 | 3000000.0 |
4163 | To Kill a Mockingbird | 8.4 | 2000000.0 |
4395 | Reservoir Dogs | 8.4 | 1200000.0 |
4550 | A Separation | 8.4 | 500000.0 |
4636 | The Other Dream Team | 8.4 | 500000.0 |