zoukankan html css js c++ java

Pandas 之 Series / DataFrame 初识

import numpy as np
import pandas as pd

Pandas will be a major tool of interest throughout(贯穿) much of the rest of the book. It contains data structures and manipulation tools designed to make data cleaning(数据清洗) and analysis fast and easy in Python. pandas is often used in tandem(串联) with numerical computing tools like NumPy and SciPy, analytical libraries like statsmodels and scikit-learn, and data visualization(可视化) libraries like matplotlib. pandas adopts(采用) sinificant(显著的,大量的) parts of NumPy's idiomatic(惯用的) style of array based computing, especially array-based functions and preference for data processing without for loops.(面向数组编程)

While pandas adopts many coding idioms(惯用的) from NumPy, the biggest difference is that pandas is disgined for working with tabular(表格型) or heterogeneous(多样型) data. NumPy, by contrast(对比), is best suite for working with homogeneous numerical array data. -> pandas 是表格型数据处理的一种最佳方案(作者很能吹的哦)

Since become an open source project in 2010, pandas has matured(成熟的) into a quite large library that is applicable(适用于) in a broad set of real-world use cases. -> 被广泛使用 The developer community has grown to over 800 distinct(活跃的) contributors, who have been helping build the project as they have used
it to solve their day-to-day data problems. -> 解决日常生活中的大量数据处理问题

Throughout the rest of the book, I use the following import convention for pandas:

import pandas as pd
# from pandas import Serieser, DataFrame

Thus, whever you see pd in code, it is refering to pandas. You may also find it easier to import Series and Dataframe into the local namespace since they are frequently used:

"from pandas import Series DataFrame"

To get start with pandas, you will need to comfortable(充分了解) with its two workhorse data structures: Series and DataFrame. While(尽管) they are not a universal solution for every problem, they provide a solid(稳定的), easy-to-use basis for most applications.

Series

A series is a one-dimensional array-like object containing a sequence of values(of similar types to NumPy types) and an associated array of data labels, called it's index. The simplest(简明来说) Series is formed from only an array of data. -> Series像是一个有索引的一维NumPy数组.

obj = pd.Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

The string representation(代表) of a Series displaye interactively(交互地) show the index on the left and the value on the right.(索引显示在左边, 值在右边) Since we did not specify(指定) an index for the data, a default one consisting of the integer 0 throught N-1(where N is the lenght of the data)(索引从0开始的) is created. You can get the array representation and index object of the Series via(通过) its values and index attributes, respectively: -> 通过其values, index属性进行访问和设置.

obj.values

array([ 4,  7, -5,  3], dtype=int64)

obj.index  # like range(4)

RangeIndex(start=0, stop=4, step=1)

Often it will be describe to create a Series with an index identifying each data point with a lable:

obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

"打印索引"
obj2.index

d    4
b    7
a   -5
c    3
dtype: int64

'打印索引'

Index(['d', 'b', 'a', 'c'], dtype='object')

Compared with NumPy arrays, you can use labels in the index when selecting single values or a set of values.-> 通过index来选取单个或多个元素

"选取单个元素[index]"
obj2['a']

"修改元素-直接赋值-修改是-inplace"
obj2['d'] = 'cj'

"选取多个元素[[index]], 注意, 没有值则会NaN, 比较健壮的"
obj2[['c', 'a', 'd', 'xx']]

'选取单个元素[index]'

-5

'修改元素-直接赋值-修改是-inplace'

'选取多个元素[[index]], 注意, 没有值则会NaN, 比较健壮的'

c:pythonpython36libsite-packagespandascoreseries.py:851: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self.loc[key]

c       3
a      -5
d      cj
xx    NaN
dtype: object

"对元素赋值修改, 默认是原地修改的"
obj2

'对元素赋值修改, 默认是原地修改的'

d    cj
b     7
a    -5
c     3
dtype: object

Here ['c', 'a', 'd'] is interpreted(被要求为) as a list of indices, even though it contains strings instead of integers.-> 多个索引的键, 先用一个列表存起来, 再作为一个参数给索引.

Using NumPy functions or NumPy-like operations, such as filtering with a boolean array, scalar multiplication(标量乘), or appplying math functions)函数映射, will preserve the index-value link: -> 像操作NumPy数组一样操作, 如bool数组, 标量乘, 数学函数等..

"过滤出Series中大于0的元素及对应索引"
"先还原数据, 字符不能和数字比较哦"
obj2['d'] = 4 

obj2[obj2 > 0]

"标量计算"
obj2 * 2

"调用NumPy函数"
"需要用values过滤掉索引, cj 觉得, 不然会报错"
np.exp(obj.values)

'过滤出Series中大于0的元素及对应索引'

'先还原数据, 字符不能和数字比较哦'

d    4
b    7
c    3
dtype: object

'标量计算'

d      8
b     14
a    -10
c      6
dtype: object

'调用NumPy函数'

'需要用values过滤掉索引, cj 觉得, 不然会报错'

array([5.45981500e+01, 1.09663316e+03, 6.73794700e-03, 2.00855369e+01])

"cj test"
obj2 > 0

np.exp(obj2)

'cj test'

d     True
b     True
a    False
c     True
dtype: bool

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-39-86002a981278> in <module>
      2 obj2 > 0
      3 
----> 4 np.exp(obj2)

AttributeError: 'int' object has no attribute 'exp'

Another way to think about a Series is as fixed-lenght, ordered dict, as it's a mapping of index values to data values. -> (Series可以看做是一个有序字典映射, key是index, value.) It can be used in many contexts(情景) where you might use a dict:

"跟字典操作一样, 遍历, 选取, 默认都是对key进行操作"

'b' in obj2
'xxx' in obj2

'跟字典操作一样, 遍历, 选取, 默认都是对key进行操作'

True

False

Should you have data contained in a Python dict, you can create a Series from it by pass the dict: -> 可直接将Python字典对象转为Series, index就是key.

sdata = {'Ohio':35000, 'Texas':71000, 'Oregon':16000, 'Utah':5000}

"直接可将字典转为Series"
obj3 = pd.Series(sdata)
obj3

'直接可将字典转为Series'

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

# cj test

"多层字典嵌套也是可以的, 但只会显示顶层结构"

cj_data = {'Ohio':{'sex':1, 'age':18}, 'Texas':{'cj':123}}

pd.Series(cj_data)

'多层字典嵌套也是可以的, 但只会显示顶层结构'

Ohio     {'sex': 1, 'age': 18}
Texas              {'cj': 123}
dtype: object

When you are only passing a dict, the index in the resulting Series will have the dict's keys in sorted order. You can override this by passing the dict keys in order you want them to appear in the resulting Series: -> 传入字典对象, 默认的index是key, 可以通过重写index来达到任何我们期望的结果:

"重写, 覆盖掉原来的index"

states = ['California', 'Ohio', 'Oregon', 'Texas']

"相同的字段直接 替换, 没有的字段, 则显示为NA"
obj4 = pd.Series(sdata, index=states)
obj4

'重写, 覆盖掉原来的index'

'相同的字段直接 替换, 没有的字段, 则显示为NA'

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

Here, three values found in sdata were palced in the appropriate(适当的) location, (替换, 字段相同), but since no value for 'Carlifornia' was found, it appears as NaN(not a number), which is considered in pandas to mark(标记) missing or NA values. Since 'Utah' was not include in states, it is excluded from the resulting object.

I will use the terms(短语) 'missing' or 'NA' interchangeably(交替地) to refer to(涉及) missing data. The isnull and notnull functions in pandas should be used to detect(检测) missing data:

"pd.isnull(), pd.notnull() 用来检测缺失值情况"
pd.isnull(obj4)

"正向逻辑"
pd.notnull(obj4)

"Series also has these as instance methods:"
obj4.notnull()

'pd.isnull(), pd.notnull() 用来检测缺失值情况'

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

'正向逻辑'

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

'Series also has these as instance methods:'

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

I discuss working with missing data in more detail in Chapter 7.

A usefull Series feature for many applications is that it automatically(自动地) aligns(对齐) index label in arithmetic operations. -> Series 在算数运算中, 会自动地对齐索引,即相同索引, 会被认为一个索引这点很关键.

obj3
obj4

"obj3 + obj4, index相同, 直接数值相加, 不想同则NaN"
obj3 + obj4

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

'obj3 + obj4, index相同, 直接数值相加, 不想同则NaN'

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Data alignment features(数据对齐的功能) will be in addressed in more detail later. If you have experience with databases, you can think about this as being simalar to a join operation. ->(数据对齐, 就跟数据的的连接是相似的, 内连接, 左连接, 右连接)

Both the Series object itself and its index hava a name attribute, which integrates(一体化) with other keys areas of pandas functionality: -> (name属性, 是将一些键区域联系在一起的)

"设置键的名字 obj4.name='xxx'"
obj4.name = 'population'

"设置索引的名字 obj4.index.name = 'xxx'"
obj4.index.name = 'state'

obj4

"设置键的名字 obj4.name='xxx'"

"设置索引的名字 obj4.index.name = 'xxx'"

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

A Series's index can be altered(改变) in-place by assignment. -> index 可通过赋值的方式, 原地改变

obj

"通过obj.index = 'xxx'实现原地修改索引, 数量不匹配则会报错哦"

obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

"通过obj.index = 'xxx'实现原地修改索引, 数量不匹配则会报错哦"

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

DataFrame

A DataFrame represents a rectangular table of data(矩形数据表) and contains an ordered collecton of columns, each of which can be different value type(numeric, string, boolean, etc..)-> (每一列可以包含不同的数据类型) The DataFrame has both a row and column index;(包含有行索引index, 和列索引columns)
It can be thought of as a dict fo Series all sharing the same index.(共享相同索引的Series) Under the hood(从底层来看) the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection fo one-dimensional arrays.(数据被存储为多个二维数组块而非list, dict, 或其他一维数组) The exact(详细的) details of DataFrame's internals(底层原理) are outside the scope of this book.

While a DataFrame is physically(原本用来表示) two-dimensional, you can use it to represent higher dimensional data in a tabular format using hierarchical(分层的) indexing, a subject we wil discuss in Chapter8 and an ingredient(成分) in some of the more advanced data-handling features in pandas. -> 分层索引处理多维数据, 和更多处理高维数据的先进功能在pandas中都能学习到.

There are many ways to construct(构造) a DataFrame, though one of the most common is from a dict of equal-length lists of or NumPy array. ->(构造一个DataFrame最常见的方式是传入一个等长字典, or 多维数组)

data = {
    'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
    'year': [2000, 2001, 2002, 2001, 2002, 2003],
    'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
}

frame = pd.DataFrame(data)

The resulting DataFrame will have its index assigned automatically as with Series, and the columns are placed in sorted order:

frame

	state	year	pop
0	Ohio	2000	1.5
1	Ohio	2001	1.7
2	Ohio	2002	3.6
3	Nevada	2001	2.4
4	Nevada	2002	2.9
5	Nevada	2003	3.2

If you are using the Jupyter notebook, pandas DataFrame objects will be displayed as a more browser-friendly HTML table.

For large DataFrames, the head method selects only the first five rows: -> df.head() 默认查看前5行

frame.head()

	state	year	pop
0	Ohio	2000	1.5
1	Ohio	2001	1.7
2	Ohio	2002	3.6
3	Nevada	2001	2.4
4	Nevada	2002	2.9

If you specify a sequence of columns, The DataFrame's columns will be arranged in that order: -> 指定列的顺序

"按指定列的顺序排列"
pd.DataFrame(data, columns=['year', 'state', 'pop'])

'按指定列的顺序排列'

	year	state	pop
0	2000	Ohio	1.5
1	2001	Ohio	1.7
2	2002	Ohio	3.6
3	2001	Nevada	2.4
4	2002	Nevada	2.9
5	2003	Nevada	3.2

If you pass a column that isn't contained in the dict, it will appear with missing values the result:

frame2 = pd.DataFrame(data, 
                     columns=['year', 'state', 'pop', 'debt'],
                     index=['one', 'two', 'three', 'four', 'five', 'six'])

"对于没有的 columns, 则会新建, 值为NaN"
frame2

"index没有, 则会报错哦, frame.columns 可查看列索引"
frame2.columns

'对于没有的 columns, 则会新建, 值为NaN'

	year	state	pop	debt
one	2000	Ohio	1.5	NaN
two	2001	Ohio	1.7	NaN
three	2002	Ohio	3.6	NaN
four	2001	Nevada	2.4	NaN
five	2002	Nevada	2.9	NaN
six	2003	Nevada	3.2	NaN

'index没有, 则会报错哦, frame.columns 可查看列索引'

Index(['year', 'state', 'pop', 'debt'], dtype='object')

A column in a DataFrame can be retrieve(被检索) as a Series either by dict-like notation or by attribute:
->(列表作为索引, 或者df.列名)

"中括号索引[字段名]"
frame2['state']

"通过属方式 df.字段名"
frame2.state

'中括号索引[字段名]'

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

'通过属方式 df.字段名'

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

Attribute-like access(eg, frame2.year) and tab completion(完成) of column names in Ipython is provided as a convenience. -> 通过属性的方式来选取列名是挺方便的.
Frame2[column] works for any column name, but frame2.column only works when the column name is valid Python variable name.

Note that the returned Series have the same index as the DataFrame,(返回的Series具有相同的索引) and their name attribute has been appropriately(适当地) set.

Rows can also be retrieve by position or name with the special loc attribute(much more than this later) -> loc属性用来选取行...

"选取index为three的行 loc[index]"
frame2.loc['three']

"选取第二行和第三行, frame.loc[1:2]"
frame.loc[1:2]

'选取index为three的行 loc[index]'

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

'选取第二行和第三行, frame.loc[1:2]'

	state	year	pop
1	Ohio	2001	1.7
2	Ohio	2002	3.6

Columns can be modified by assignment. For example, the enpty 'debt' column could be assigned a scalar value or an array of values: -> 原地修改值

frame2['debet'] = 16.5

"原地修改了整列的值了"
frame2

'原地修改了整列的值了'

	year	state	pop	debt	debet
one	2000	Ohio	1.5	NaN	16.5
two	2001	Ohio	1.7	NaN	16.5
three	2002	Ohio	3.6	NaN	16.5
four	2001	Nevada	2.4	NaN	16.5
five	2002	Nevada	2.9	NaN	16.5
six	2003	Nevada	3.2	NaN	16.5

"原地修改, 自动对齐"
frame2['debet'] = np.arange(6)

"删除掉debt列, axis=1, 列, inplace=True原地删除"
frame2.drop(labels='debt', axis=1, inplace=True)

frame2

'原地修改, 自动对齐'

'删除掉debt列, axis=1, 列, inplace=True原地删除'

	year	state	pop	debet
one	2000	Ohio	1.5	0
two	2001	Ohio	1.7	1
three	2002	Ohio	3.6	2
four	2001	Nevada	2.4	3
five	2002	Nevada	2.9	4
six	2003	Nevada	3.2	5

frame2.columns

Index(['year', 'state', 'pop', 'debet'], dtype='object')

frame2.drop()

frame2['debt']

one      0
two      1
three    2
four     3
five     4
six      5
Name: debt, dtype: int32

When you are assigning list or arrays to a column, the value's lenght must match the lenght of the DataFrame.(插入数据的长度必须能对齐, 不然后缺失值了) If you assign a Series, it's labels will be realigned exactly to the DataFrame's index, inserting missing values in any holes:

val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])

"自动对齐, 根据index"
frame2['debet'] = val

frame2

'自动对齐, 根据index'

	year	state	pop	debet
one	2000	Ohio	1.5	NaN
two	2001	Ohio	1.7	-1.2
three	2002	Ohio	3.6	NaN
four	2001	Nevada	2.4	-1.5
five	2002	Nevada	2.9	-1.7
six	2003	Nevada	3.2	NaN

Assigning a column that doesn't exist will create a new colum. The del keyword will delete columns as with a dict. -> del 来删除列

As an example of del, I first add a new column of boolean values where the state columns equals 'Ohio':

frame2['eastern'] = frame2.state == 'Ohio'

"先新增一列 eastern"
frame2

"然后用 del 关键子去删除该列"
del frame2['eastern']

"显示字段名, 发现 eastern列被干掉了, 当然, drop()方法也可以"
frame2.columns

'先新增一列 eastern'

	year	state	pop	debet	eastern
one	2000	Ohio	1.5	NaN	True
two	2001	Ohio	1.7	-1.2	True
three	2002	Ohio	3.6	NaN	True
four	2001	Nevada	2.4	-1.5	False
five	2002	Nevada	2.9	-1.7	False
six	2003	Nevada	3.2	NaN	False

'然后用 del 关键子去删除该列'

'显示字段名, 发现 eastern列被干掉了, 当然, drop()方法也可以'

Index(['year', 'state', 'pop', 'debet'], dtype='object')

The column returned from indexing a DataFrame is a view on teh underlying data, not a copy.(视图哦, in-place的) Thus, any in-place modifications to the Series will be reflected in the DataFrame. The column can be explicitly copied with the Serie's copy method. -> 可以显示指定列进行拷贝, 不然操作的是视图.

Another common form of data is a nested dict of dicts:

pop = {
    'Nevada': {2001:2.4, 2002:2.9},
    'Ohio': {2000:1.5, 2001:1.7, 2002:3.6}
}

If the nested dict is passed to the DataFrame, pandas will interpret the outer dict keys as the columns and the inner keys as the row indices: ->(字典一层嵌套, pandas 会将最外层key作为columns, 内层key作为index)

frame3 = pd.DataFrame(pop)
"外层字典的键作为column, 值的键作为index"
frame3

'外层字典的键作为column, 值的键作为index'

	Nevada	Ohio
2000	NaN	1.5
2001	2.4	1.7
2002	2.9	3.6

You can transpose the DataFrame(swap rows and columns) with similar syntax to a NumPy array:

"转置"
frame3.T

'转置'

	2000	2001	2002
Nevada	NaN	2.4	2.9
Ohio	1.5	1.7	3.6

The keys in the inner dicts(内部键, index) are combined and sorted to form the index in the result. This isn't true if an explicit index is specified:

# pd.DataFrame(pop, index=('a', 'b','c'))

Dicts of Series are treated in much the same way.

pdata = {
    'Ohio': frame3['Ohio'][:-1],
    'Nevada': frame3['Nevada'][:2]
}

pd.DataFrame(pdata)

	Ohio	Nevada
2000	1.5	NaN
2001	1.7	2.4

For a complete list of things you can pass the DataFrame constructor(构造), see Table5-1.
If a DataFrame's index and columns have their name attributes, these will also be displayed: -> 设置行列索引的名字属性

frame3.index.name = 'year'
frame3.columns.name = 'state'

frame3

state	Nevada	Ohio
year
2000	NaN	1.5
2001	2.4	1.7
2002	2.9	3.6

As with Series, the values attribute returns the data contained in the DataFrame as a two-dimensional ndarray: -> values属性返回的是二维的

frame3.values

array([[nan, 1.5],
       [2.4, 1.7],
       [2.9, 3.6]])

If the DataFrame's columns are different dtypes, the dtype of the values array will be chosen to accommodate(容纳) all of the columns.

"会自动选择dtype去容纳各种类型的数据"
frame2.values

'会自动选择dtype去容纳各种类型的数据'

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, nan],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, nan],
       [2002, 'Nevada', 2.9, nan],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

Table 5-1 Possible data inputs to DataFrame constructor

2D ndarray A matrix of data, passing optional and columns labels
.......用到再说吧

Index Objects

pandas's Index objects are responsible(保存) for holding the axis labels and other metadata(like the axis name or names). Any array or other sequence of lables you use when constructing(构造) a Series or DataFrame is internally(内部地) converted to an Index(转为索引):

obj = pd.Series(range(3), index=['a', 'b', 'c'])

index = obj.index
index

index[1:]
obj

Index(['a', 'b', 'c'], dtype='object')

Index(['b', 'c'], dtype='object')

a    0
b    1
c    2
dtype: int64

Index objects are immutable(不可变的) and thus can't be modified by the user:

index[1] = 'd'

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-14-a452e55ce13b> in <module>
----> 1 index[1] = 'd'

c:pythonpython36libsite-packagespandascoreindexesase.py in __setitem__(self, key, value)
   2063 
   2064     def __setitem__(self, key, value):
-> 2065         raise TypeError("Index does not support mutable operations")
   2066 
   2067     def __getitem__(self, key):

TypeError: Index does not support mutable operations

"index 不可变哦"
index

'index 不可变哦'

Index(['a', 'b', 'c'], dtype='object')

labels = pd.Index(np.arange(3))
labels

Int64Index([0, 1, 2], dtype='int64')

obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2

0    1.5
1   -2.5
2    0.0
dtype: float64

obj2.index is labels

True

Unlike Python sets, a pandas Index can con

Selections with dumplicate labels will select all occurrences(发生) of that label.

Each Index has a number of methods and properties for set logic which answer other common questions about the data it contains. Some useful ones are summarized in Table 5-2

append Concatenate with additional Index objects, producing a new index
difference Compute set difference as Index
intersection Compute set intersection
union Compute set union
isin -> 是否在里面
delete Compute new index with element at index i deleted
drop Compute new index by deleting passed values
insert Compute new index by inserting element at index i
is_unique Return True if the index has no duplicate values
unique Compute the array of unique values in the index.

查看全文

相关阅读:
把word文档转换成swf格式
 利用“审阅”批改作业
 注意：QQ空间加密并不安全
 MySQLDB 错误 InterfaceError(0,")
Linux 文件大小文件夹大小磁盘大小
 JavaArrays类fill()方法详解
 构造函数
 ASP部署错误"未能加载类型..."
试AJAX出错两则
 ASP.Net如何区分开发状态与实际应用状态

原文地址：https://www.cnblogs.com/chenjieyouge/p/11869423.html