zoukankan      html  css  js  c++  java
  • Python的工具包[1] -> pandas数据预处理 -> pandas 库及使用总结

    pandas数据预处理 / pandas data pre-processing


    目录

    1. 关于 pandas
    2. pandas 库
    3. pandas 基本操作
    4. pandas 计算
    5. pandas 的 Series
    6. pandas 常用函数
    7. 补充内容

    1 关于pandas / About pandas

    Pandas起源

    Python Data Analysis Library或pandas是基于NumPy的一种工具,该工具是为了解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具。pandas提供了大量能使我们快速便捷地处理数据的函数和方法。

    Pandas是python的一个数据分析包,最初由AQR Capital Management于2008年4月开发,并于2009年底开源出来,目前由专注于Python数据包开发的PyData开发team继续开发和维护,属于PyData项目的一部分。Pandas最初被作为金融数据分析工具而开发出来,因此,pandas为时间序列分析提供了很好的支持。 Pandas的名称来自于面板数据(panel data)和python数据分析(data analysis)。panel data是经济学中关于多维数据集的一个术语,在Pandas中也提供了panel的数据类型。

    Pandas中的数据结构

    Series:

    一维数组,与Numpy中的一维Array类似。二者与Python基本的数据结构List也很相近,其区别是,List中的元素可以是不同的数据类型,而Array和Series中则只允许存储相同的数据类型,这样可以更有效的使用内存,提高运算效率。

    Time- Series:

    以时间为索引的Series。

    DataFrame:

    二维的表格型数据结构。很多功能与R中的data.frame类似。可以将DataFrame理解为Series的容器。以下的内容主要以DataFrame为主

    Panel:

    三维的数组,可以理解为DataFrame的容器

    Pandas中一般的数据结构构成为DataFrame -> Series -> ndarray

    2 pandas / pandas Library

    环境安装:

    pip install pandas

    2.1 常量 / Constants

    pass

    2.2 函数 / Function

    2.2.1 read_csv()函数

    函数调用: info = pd.read_csv(filename)

    函数功能:读取指定的csv文件,生成一个包含csv数据的DataFrame

    传入参数: filename

    filename: str类型,需要读取的文件名

    返回参数: info

    info: DataFrame类型,读取文件生成的DataFrame

    类似方法还有: read_excel / read_json / read_sql / read_html

    2.2.2 isnull()函数

    函数调用: bool = pd.isnull(obj)

    函数功能:返回一个包含数据是否是null的信息数据

    传入参数: obj

    obj: DataFrame/Series类型,待判断的数据

    返回参数: bool

    bool: DataFrame/Series类型,返回的判断结果,True表示null,False则不是

    2.2.3 to_datetime()函数

    函数调用: date = pd.to_datetime(arg)

    函数功能:将传入的数据转换成日期数据格式返回

    传入参数: arg

    arg: int/float/srting/datetime/list/tuple/1-d array/Series类型,argument,可传入一维数组或Series,0.18.1版本中加入DataFrame和dict-like结构

    返回参数: date

    date: 返回的数据类型由传入的参数确定

    Note: pandas中通过to_datetime函数转换的而成的数据其dtype为datetime64[ns],该数据存在的Series可以通过.dt.month/year/day获取所需要的日期信息

    2.3 / Class

    2.3.1 DataFrame

    类实例化:df = pd.DataFrame(data, index=) / pd.read_xxx(file_name)

    类的功能:用于生成DataFrame

    传入参数: data, index / file_name

    data: ndarray类型,包含需要构建成DataFrame的数据(二维)

    index: Series类型,决定作为索引的列参数

    file_name: str类型,需要读取的文件名

    返回参数: df

    df: DataFrame类型,生成的DataFrame

    2.3.1.1 dtypes属性

    属性调用: fmt = df.dtypes

    属性功能: 返回数据结构中每列的数据类型(由于是多个,使用dtypes,numpy中单个,使用dtype)

    属性参数: fmt

    fmt: Series类型,包含每个数据值的数据类型,index为列名,value为类型,其中,object类型相当于Python中的string

    2.3.1.2 columns属性

    属性调用: index_name = df.columns

    属性功能: 返回数据结构中每列的列名

    属性参数: index_name

    Index_name: Index类型,<class 'pandas.core.indexes.base.Index'>,包含每列的列名

    2.3.1.3 shape属性方法

    属性调用: shp = df.shape

    属性功能: 返回数据结构的行列参数

    属性参数: shp

    shp: tuple类型,(row, column),返回行列数

    2.3.1.4 loc属性

    属性调用: index = df.loc

    属性功能: 返回一个index的类

    属性参数: index

    index: obj类型,<class 'pandas.core.indexing._LocIndexer'>,可用于切片获取数据信息的DataFrame,如index[0]获取第一行,index[3:7]获取3-7行的数据

    2.3.1.5 head()方法

    函数调用: hdf = df.head(num=5)

    函数功能: 返回csv列表中的前num行数据

    传入参数: num

    num: int类型,需要获取的行数

    返回参数: hdf

    hdf: DataFrame类型,原数据的前num行数据

    2.3.1.6 tail()方法

    函数调用: tdf = df.tail(num=5)

    函数功能: 返回csv列表中的后num行数据

    传入参数: num

    num: int类型,需要获取的行数

    返回参数: tdf

    tdf: DataFrame类型,原数据的后num行数据

    2.3.1.7 describe()方法

    函数调用: ddf = df.describe()

    函数功能: 返回csv列表中每个列的一些统计描述参数

    返回参数:

    返回参数: ddf

    ddf: DataFrame类型,包括的信息有,每一列的数量count,均值mean,标准差std,最小值min,1/4位数25%,中位数50%,3/4位数75%,最大值max

    2.3.1.8 sort_values()方法

    函数调用: sdf = df.sort_values(by, axis=0, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’)

    函数功能: 返回按参数排序的DataFrame

    传入参数: by, axis, ascending, inplace, kind, na_position

    by: str类型,DataFrame的行/列名

    axis: int类型,0按列(第一轴)sort,1按行(最后轴)sort

    ascending: bool类型,True为升序排列, False为降序排列

    inplace: bool类型,True则修改原DataFrame,False则返回新的DataFrame

    kind: str类型,确定sort的排序算法,包括{‘quicksort’, ‘mergesort’, ‘heapsort’}

    na_position: str类型,确定na数据存在的位置,‘first’/‘last’

    返回参数: sdf

    sdf: DataFrame类型,重排后的DataFrame

    2.3.1.9 mean ()方法

    函数调用: mdf = df.mean(axis=0)

    函数功能: 返回存储所有非NaN的值的平均值DataFrame

    传入参数: axis

    axis: int类型,0按列(第一轴)sort,1按行(最后轴)sort

    返回参数: mdf

    mdf: DataFrame类型,存储均值的数据类型为float

    2.3.1.10 pivot_table ()方法

    函数调用: cdf = df.pivot_table(index=, values=, aggfunc=)

    函数功能: 根据index将数据分组,对于values列的值(相同类型)执行aggfunc函数

    传入参数: index, values, aggfunc

    index: str类型,进行分组的列的列名

    values: str/list类型,需要计算的列的列名,多个则使用list

    aggfunc: method类型,需要调用的方法

    返回参数: cdf

    cdf: DataFrame类型,通过自定义函数运算后得到的DataFrame

    2.3.1.11 dropna ()方法

    函数调用: ddf = df.dropna(axis=0, how=’any’, thresh=None, subset=None, inplace=False)

    函数功能: 根据要求删除带有NaN值的行列

    传入参数: axis, how, thresh, subset, inplace

    axis: int/str类型,搜索方向,0/‘index’为行搜索,1/‘columns’为列搜索

    how: str类型,‘any’只要出现NA值就删除该行/列数据,‘all’所有值都是NA才删除

    thresh: int/None类型,表示对有效数据数量的最小要求(为2则要求该行/列至少2个有效非NA数据存在)

    subset: str/list类型,表示在特定子集中寻找NA

    inplace: bool类型,表示是否在原数据操作,True修改原数据,False返回新数据

    返回参数: cdf

    cdf: DataFrame类型,通过删除NA函数运算后得到的DataFrame

    2.3.1.12 reset_index ()方法

    函数调用: rdf = df.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill=’’)

    函数功能: 重置(一般为经过排序后的)DataFrame的序号

    传入参数: level, drop, inplace, col_level, col_fill

    level: int/str/tuple/list类型,Only remove the given levels from the index. Removes all levels by default

    drop: bool类型,是否删除原始的index列,True删除,False保留

    inplace: bool类型,是否在原数据上操作

    col_level: int/str类型,If the columns have multiple levels, determines which level the labels are inserted into. By default it is inserted into the first level

    col_fill: obj类型,If the columns have multiple levels, determines how the other levels are named. If None then the index name is repeated.

    返回参数: rdf

    rdf: DataFrame类型,通过重排index后的DataFrame

    2.3.1.13 set_index ()方法

    函数调用: sdf = df.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)

    函数功能: 根据现有的columns参数重新设置index索引

    传入参数: keys, drop, append, inplace, verify_integrity

    keys: str类型,需要作为索引的列名

    drop: bool类型,是否删除作为索引的列,True删除,False保留

    append: bool类型,是否添加默认的index(序号索引)

    inplace: bool类型,是否在原数据上操作

    verify_integrity: bool类型,Check the new index for duplicates. Otherwise defer the check until necessary. Setting to False will improve the performance of this method

    返回参数: sdf

    sdf: DataFrame类型,通过重设index后的DataFrame

    2.3.1.14 apply ()方法

    函数调用: re = df.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)

    函数功能: 对DataFrame相应的方向使用自定义函数

    传入参数: func, axis, broadcast, raw, reduce, args, **kwds

    func: method类型,用于各行/列的函数

    axis: int/str类型,0/‘index’对每列使用函数,1/‘column’对每行使用函数

    broadcast: bool类型,For aggregation functions, return object of same size with values propagated

    raw: bool类型,If False, convert each row or column into a Series. If raw=True the passed function will receive ndarray objects instead. If you are just applying a NumPy eduction function this will achieve much better performance

    reduce: bool/None类型,Try to apply reduction procedures. If the DataFrame is empty, apply will use reduce to determine whether the result should be a Series or a DataFrame. If reduce is None (the default), apply's return value will be guessed by calling func an empty Series (note: while guessing, exceptions raised by func will be ignored). If reduce is True a Series will always be returned, and if False a DataFrame will always be returned

    args: tuple类型,Positional arguments to pass to function in addition to the array/series

    **kwds: 其余关键字参数将会被当做参数传给调用函数

    返回参数: rdf

    rdf: DataFrame类型,通过重排index后的DataFrame

    2.3.1.15 ix属性

    属性调用: ix_obj = df.ix

    属性功能: 返回一个index类的数据

    属性参数: ix_obj

    ix_obj: obj类型,<class 'pandas.core.indexing._IXIndexer'>

    Note: 后续可通过ix_obj[rows, cols]获取DataFrame或Series,rows/cols可以是需要取的行索引/列名

    2.3.2 Series

    类实例化:sr = pd.Series(data, index=) / df[colomn_name]

    类的功能:用于生成Series

    传入参数: data, index / column_name

    data: ndarray类型,包含需要构建成Series的数据(一维)

    index: Series类型,决定作为索引的列参数

    column_name: str类型,需要获取Series的列名

    返回参数: sr

    sr: Series类型,生成的Series

    2.3.2.1 values属性

    属性调用: values = sr.values

    属性功能: 返回Series的所有value值

    属性参数: values

    values: ndarray类型,Series的所有值形成的一维ndarray

    2.3.2.2 tolist()方法

    函数调用: list =sr.tolist()

    函数功能:将Series或Index类的数据变成list形式返回

    传入参数:

    返回参数: list

    list: list类型,返回的数据列表

    2.3.2.3 max/min()方法

    函数调用: value =sr.max/min()

    函数功能:获取Series中的最大/最小值

    传入参数:

    返回参数: value

    value: int/str等类型,返回的最值

    2.3.2.4 sort_values()方法

    函数调用: ssr = sr.sort_values(axis=0, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’)

    函数功能: 返回按参数排序的Series

    传入参数: axis, ascending, inplace, kind, na_position

    axis: int类型,0按列(第一轴)sort,1按行(最后轴)sort

    ascending: bool类型,True为升序排列, False为降序排列

    inplace: bool类型,True则修改原DataFrame,False则返回新的DataFrame

    kind: str类型,确定sort的排序算法,包括{‘quicksort’, ‘mergesort’, ‘heapsort’}

    na_position: str类型,确定na数据存在的位置,‘first’/‘last’

    返回参数: ssr

    ssr: Series类型,重排后的Series

    2.3.2.5 mean ()方法

    函数调用: msr = sr.mean()

    函数功能: 返回存储所有非NaN的值的平均值Series

    传入参数:

    返回参数: msr

    msr: Series类型,存储均值的数据类型为float

    2.3.2.6 reset_index ()方法

    函数调用: rsr = sr.reset_index(level=None, drop=False, name=None, inplace=False)

    函数功能: 重置(一般为经过排序后的)Series的序号

    传入参数: level, drop, name, inplace

    level: int/str/tuple/list类型,Only remove the given levels from the index. Removes all levels by default

    drop: bool类型,是否删除原始的index列,True删除,False保留

    name: obj类型,The name of the column corresponding to the Series values

    inplace: bool类型,是否在原数据上操作

    返回参数: rsr

    rsr: Series类型,通过重排index后的Series

    2.3.2.7 value_counts ()方法

    函数调用: csr = sr.value_counts(dropna=True)

    函数功能: 计算Series中各个values值的数量

    传入参数: dropna

    dropna: bool类型,是否计算NA的数量,True不计算,False计算

    返回参数: csr

    csr: Series类型,各数据值为索引,数量为value的Series

    3 pandas基本操作

    首先对csv文件进行读取操作,利用read_csv函数,值得注意的是,存储的csv文件必须利用Excel另存为的方式获得,而不能以修改后缀名的方法得到

    1 import pandas as pd
    2 
    3 # info = pd.read_csv('info.csv', encoding='latin1')
    4 # info = pd.read_csv('info.csv', encoding='ISO-8859-1')
    5 # info = pd.read_csv('info.csv', encoding='cp1252')
    6 info = pd.read_csv('info.csv')
    7 # Get the info of whole csv list, and the info of row and column
    8 print(info)

    输出结果为

         No.       Type  Info      Number Rank Mark.
    0   1001   BUTTER_1   NaN    4.000000    A    cc
    1   1002   BUTTER_2   NaN         NaN    C    dd
    2   1003   BUTTER_3   NaN         NaN  NaN    ff
    3   1004   BUTTER_4   NaN         NaN  NaN   NaN
    4   1005   BUTTER_5    df  543.000000    F    cx
    5   1006   BUTTER_6    fa  345.000000    A    cc
    6   1007   BUTTER_7   jhf   67.000000    S    dd
    7   1008   BUTTER_8    ad  567.000000    S    ff
    8   1009   BUTTER_9  gdfs   34.000000    C    aa
    9   1010  BUTTER_10  vczx   34.000000    C    cx
    10  1011  BUTTER_11    as   89.000000    E    cc
    11  1012  BUTTER_12    cd   90.000000    D    dd
    12  1013  BUTTER_13   qwe   14.000000    S    ff
    13  1014    WATER_1   asd  186.635198    A    aa
    14  1015    WATER_2    as  222.000000    B    cc
    15  1016    WATER_3    fa  193.026806    A    cc
    16  1017    WATER_4   jhf  196.222611    C    dd
    17  1018    WATER_5    ad  199.418415    B    ff
    18  1019    WATER_6  gdfs  202.614219    D    aa
    19  1020    WATER_7  vczx  205.810023    F    cx
    20  1021    WATER_8    as  209.005827    A    cc
    21  1022    WATER_9    cd  212.201632    S    dd
    22  1023   WATER_10   qwe  215.397436    S    ff
    23  1024   WATER_11   asd  218.593240    C    aa
    24  1025   WATER_12    df  221.789044    C    cx
    25  1026   WATER_13    fa  224.984848    E    cc
    26  1027   WATER_14   jhf  228.180653    D    dd
    27  1028   WATER_15    ad  231.376457    S    ff
    28  1029   WATER_16  gdfs  234.572261    A    aa
    29  1030   WATER_17  vczx  237.768065    B    cx
    ..   ...        ...   ...         ...  ...   ...
    70  1071  CHEESE_11    as  368.796037    E    cc
    71  1072  CHEESE_12    cd  371.991842    D    dd
    72  1073  CHEESE_13   qwe  375.187646    S    ff
    73  1074  CHEESE_14   asd  378.383450    A    aa
    74  1075  CHEESE_15    df  381.579254    B    cx
    75  1076  CHEESE_16    fa  384.775058    A    cc
    76  1077  CHEESE_17   jhf  387.970863    C    dd
    77  1078  CHEESE_18    ad  391.166667    B    ff
    78  1079  CHEESE_19  gdfs  394.362471    D    aa
    79  1080  CHEESE_20  vczx  397.558275    F    cx
    80  1081  CHEESE_21    as  400.754079    A    cc
    81  1082  CHEESE_22    cd  403.949883    S    dd
    82  1083  CHEESE_23   qwe  407.145688    S    ff
    83  1084  CHEESE_24   asd  410.341492    C    aa
    84  1085  CHEESE_25    df  413.537296    C    cx
    85  1086     MILK_1    fa  416.733100    E    cc
    86  1087     MILK_2   jhf  419.928904    D    dd
    87  1088     MILK_3    ad  423.124709    S    ff
    88  1089     MILK_4  gdfs  426.320513    A    aa
    89  1090     MILK_5  vczx  429.516317    B    cx
    90  1091     MILK_6    as  432.712121    A    cc
    91  1092     MILK_7    cd  435.907925    C    dd
    92  1093     MILK_8   qwe  439.103730    B    ff
    93  1094     MILK_9   asd  442.299534    D    aa
    94  1095    MILK_10    df  445.495338    F    cx
    95  1096    MILK_11    fa  448.691142    A    cc
    96  1097    MILK_12   jhf  451.886946    S    dd
    97  1098    MILK_13    ad  455.082751    S    ff
    98  1099    MILK_14  gdfs  458.278555    C    aa
    99  1100    MILK_15  vczx  461.474359    C    cx
    
    [100 rows x 6 columns]
    View Code

    可以看到,pandas已经将csv文件中的数据成功导入

    接着可以查看导入的数据类型

     1 # Get the type of info
     2 print(type(info))       # <class 'pandas.core.frame.DataFrame'>
     3 print('-----------')
     4 # Get the type of each column(The object dtype equal to the string type in python)
     5 print(info.dtypes)      ''' No.         int64
     6                             Type       object
     7                             Info       object
     8                             Number    float64
     9                             Rank       object
    10                             Mark.      object
    11                             dtype: object '''

    最后还可以利用基本函数获取前/后 n 行,列名信息以及基本描述等

     1 # Get the first x row of csv list, default is 5
     2 print(info.head(7))
     3 print('-----------')
     4 # Get the last x row of csv list, default is 5
     5 print(info.tail(7))
     6 print('-----------')
     7 # Get the name of each column
     8 print(info.columns)
     9 print('-----------')
    10 # Get the shape of csv list
    11 print(info.shape)
    12 print('-----------')
    13 # Get the statistics parameter of cvs list(for digit data)
    14 # Such as count, mean, standard deviation, min, 25%, 50%, 75%, max
    15 print(info.describe())

    输出结果

        No.      Type Info  Number Rank Mark.
    0  1001  BUTTER_1  NaN     4.0    A    cc
    1  1002  BUTTER_2  NaN     NaN    C    dd
    2  1003  BUTTER_3  NaN     NaN  NaN    ff
    3  1004  BUTTER_4  NaN     NaN  NaN   NaN
    4  1005  BUTTER_5   df   543.0    F    cx
    5  1006  BUTTER_6   fa   345.0    A    cc
    6  1007  BUTTER_7  jhf    67.0    S    dd
    -----------
         No.     Type  Info      Number Rank Mark.
    93  1094   MILK_9   asd  442.299534    D    aa
    94  1095  MILK_10    df  445.495338    F    cx
    95  1096  MILK_11    fa  448.691142    A    cc
    96  1097  MILK_12   jhf  451.886946    S    dd
    97  1098  MILK_13    ad  455.082751    S    ff
    98  1099  MILK_14  gdfs  458.278555    C    aa
    99  1100  MILK_15  vczx  461.474359    C    cx
    -----------
    Index(['No.', 'Type', 'Info', 'Number', 'Rank', 'Mark.'], dtype='object')
    -----------
    (100, 6)
    -----------
                   No.      Number
    count   100.000000   97.000000
    mean   1050.500000  309.401389
    std      29.011492  110.975188
    min    1001.000000    4.000000
    25%    1025.750000  240.963869
    50%    1050.500000  317.663170
    75%    1075.250000  391.166667
    max    1100.000000  567.000000
    View Code

    4 pandas计算

    对于pandas,由于其基本结构是基于numpy的ndarray,因此numpy的基本计算操作对于pandas的DataFrame及Series也都适用。

    下面是pandas的一些基本计算方法的示例,

    完整代码

     1 import pandas as pd
     2 
     3 info = pd.read_csv('info.csv')
     4 # Get the certain row of csv list
     5 print(info.loc[0])
     6 print(info.loc[3:7])
     7 print('----------')
     8 # Get certain column(columns) by column name(name list)
     9 print(info['Type'])
    10 print(info[['Type', 'No.']])
    11 # Get the column name and save it as a list
    12 col_names = info.columns.tolist()
    13 print(col_names)
    14 
    15 # Filter off the column name that end with '.'
    16 dotList = []
    17 for n in col_names:
    18     if n.endswith('.'):
    19         dotList.append(n)
    20 newList = info[dotList]
    21 print(newList)
    22 
    23 # Operation for column will act to each element as numpy does
    24 print(info['Number'] * 10)
    25 
    26 # Operation for two csv with same shape will act each corresponding element
    27 x = info['Number']
    28 y = info['No.']
    29 print(x+y)
    30 # Act for string
    31 x = info['Rank']
    32 y = info['Mark.']
    33 print(x+y)
    34 
    35 # Add a column after the tail column(the dimension of new one should be same as origin)
    36 print(info.shape)
    37 info['New'] = x+y
    38 print(info.shape)
    39 print('----------')
    40 
    41 # Get the max/min value of a column
    42 print(info['Number'].max())
    43 print(info['Number'].min())
    44 
    45 num = info['Number']
    46 num_null_true = pd.isnull(num)
    47 # If these is a null value in DataFrame, the calculated result will be NaN
    48 print(sum(info['Number'])/len(info['Number'])) # return nan
    49 # Use the DataFrame == False to reverse the DataFrame
    50 good_value = info['Number'][num_null_true == False]
    51 print(sum(good_value)/len(good_value))
    52 print(good_value.mean())
    53 # mean method can filter off the missing data automatically
    54 print(info['Number'].mean())
    55 print('---------')
    View Code

    分段解释

    首先导入pandas及数据文件,利用loc获取pandas的某行数据,可以使用类似list的切片操作

     1 import pandas as pd
     2 
     3 info = pd.read_csv('info.csv')
     4 # Get the certain row of csv list
     5 print(info.loc[0])
     6 print(info.loc[3:7])
     7 print('----------')
     8 # Get certain column(columns) by column name(name list)
     9 print(info['Type'])
    10 print(info[['Type', 'No.']])

    结果如下,内容较长

    No.           1001
    Type      BUTTER_1
    Info           NaN
    Number           4
    Rank             A
    Mark.           cc
    Name: 0, dtype: object
        No.      Type Info  Number Rank Mark.
    3  1004  BUTTER_4  NaN     NaN  NaN   NaN
    4  1005  BUTTER_5   df   543.0    F    cx
    5  1006  BUTTER_6   fa   345.0    A    cc
    6  1007  BUTTER_7  jhf    67.0    S    dd
    7  1008  BUTTER_8   ad   567.0    S    ff
    ----------
    0      BUTTER_1
    1      BUTTER_2
    2      BUTTER_3
    3      BUTTER_4
    4      BUTTER_5
    5      BUTTER_6
    6      BUTTER_7
    7      BUTTER_8
    8      BUTTER_9
    9     BUTTER_10
    10    BUTTER_11
    11    BUTTER_12
    12    BUTTER_13
    13      WATER_1
    14      WATER_2
    15      WATER_3
    16      WATER_4
    17      WATER_5
    18      WATER_6
    19      WATER_7
    20      WATER_8
    21      WATER_9
    22     WATER_10
    23     WATER_11
    24     WATER_12
    25     WATER_13
    26     WATER_14
    27     WATER_15
    28     WATER_16
    29     WATER_17
            ...    
    70    CHEESE_11
    71    CHEESE_12
    72    CHEESE_13
    73    CHEESE_14
    74    CHEESE_15
    75    CHEESE_16
    76    CHEESE_17
    77    CHEESE_18
    78    CHEESE_19
    79    CHEESE_20
    80    CHEESE_21
    81    CHEESE_22
    82    CHEESE_23
    83    CHEESE_24
    84    CHEESE_25
    85       MILK_1
    86       MILK_2
    87       MILK_3
    88       MILK_4
    89       MILK_5
    90       MILK_6
    91       MILK_7
    92       MILK_8
    93       MILK_9
    94      MILK_10
    95      MILK_11
    96      MILK_12
    97      MILK_13
    98      MILK_14
    99      MILK_15
    Name: Type, Length: 100, dtype: object
             Type   No.
    0    BUTTER_1  1001
    1    BUTTER_2  1002
    2    BUTTER_3  1003
    3    BUTTER_4  1004
    4    BUTTER_5  1005
    5    BUTTER_6  1006
    6    BUTTER_7  1007
    7    BUTTER_8  1008
    8    BUTTER_9  1009
    9   BUTTER_10  1010
    10  BUTTER_11  1011
    11  BUTTER_12  1012
    12  BUTTER_13  1013
    13    WATER_1  1014
    14    WATER_2  1015
    15    WATER_3  1016
    16    WATER_4  1017
    17    WATER_5  1018
    18    WATER_6  1019
    19    WATER_7  1020
    20    WATER_8  1021
    21    WATER_9  1022
    22   WATER_10  1023
    23   WATER_11  1024
    24   WATER_12  1025
    25   WATER_13  1026
    26   WATER_14  1027
    27   WATER_15  1028
    28   WATER_16  1029
    29   WATER_17  1030
    ..        ...   ...
    70  CHEESE_11  1071
    71  CHEESE_12  1072
    72  CHEESE_13  1073
    73  CHEESE_14  1074
    74  CHEESE_15  1075
    75  CHEESE_16  1076
    76  CHEESE_17  1077
    77  CHEESE_18  1078
    78  CHEESE_19  1079
    79  CHEESE_20  1080
    80  CHEESE_21  1081
    81  CHEESE_22  1082
    82  CHEESE_23  1083
    83  CHEESE_24  1084
    84  CHEESE_25  1085
    85     MILK_1  1086
    86     MILK_2  1087
    87     MILK_3  1088
    88     MILK_4  1089
    89     MILK_5  1090
    90     MILK_6  1091
    91     MILK_7  1092
    92     MILK_8  1093
    93     MILK_9  1094
    94    MILK_10  1095
    95    MILK_11  1096
    96    MILK_12  1097
    97    MILK_13  1098
    98    MILK_14  1099
    99    MILK_15  1100
    
    [100 rows x 2 columns]
    View Code

    获取pandas的列名

    1 # Get the column name and save it as a list
    2 col_names = info.columns.tolist()
    3 print(col_names)

    结果如下

    ['No.', 'Type', 'Info', 'Number', 'Rank', 'Mark.']

    过滤出所有以‘.’结尾的列

    1 # Filter off the column name that end with '.'
    2 dotList = []
    3 for n in col_names:
    4     if n.endswith('.'):
    5         dotList.append(n)
    6 newList = info[dotList]
    7 print(newList)

    基本计算操作会作用于pandas的Series每个值

    1 # Operation for column will act to each element as numpy does
    2 print(info['Number'] * 10)

    对两个结构形状相同的Series,其运算会作用到每个values上

    1 # Operation for two csv with same shape will act each corresponding element
    2 x = info['Number']
    3 y = info['No.']
    4 print(x+y)
    5 # Act for string
    6 x = info['Rank']
    7 y = info['Mark.']
    8 print(x+y)

    创建出一个列名为‘New’的新列,值为两个列的值之和

    1 # Add a column after the tail column(the dimension of new one should be same as origin)
    2 print(info.shape)
    3 info['New'] = x+y
    4 print(info.shape)
    5 print('----------')

    获取Series中的最值

    1 # Get the max/min value of a column
    2 print(info['Number'].max())
    3 print(info['Number'].min())

    均值计算的两种方式,

    1. 直接求和平均,当计算中有NaN值时,计算的结果将会为NaN
    2. 利用mean函数进行计算,mean函数将会过自动滤掉NaN缺失数据
     1 num = info['Number']
     2 num_null_true = pd.isnull(num)
     3 # If these is a null value in DataFrame, the calculated result will be NaN
     4 print(sum(info['Number'])/len(info['Number'])) # return nan
     5 # Use the DataFrame == False to reverse the DataFrame
     6 good_value = info['Number'][num_null_true == False]
     7 print(sum(good_value)/len(good_value))
     8 print(good_value.mean())
     9 # mean method can filter off the missing data automatically
    10 print(info['Number'].mean())
    11 print('---------')

    5 pandasSeries

    下面介绍 pandas 中的数据类型 Series 的一些基本使用方法,

    完整代码

     1 import pandas as pd
     2 
     3 info = pd.read_csv('info.csv')
     4 
     5 # Fetch a series from DataFrame
     6 rank_series = info['Rank']
     7 print(type(info)) # <class 'pandas.core.frame.DataFrame'>
     8 print(type(rank_series)) # <class 'pandas.core.series.Series'>
     9 print(rank_series[0:5])
    10 
    11 # New a series
    12 from pandas import Series
    13 # Build a rank series
    14 rank = rank_series.values
    15 print(rank)
    16 # DataFrame --> Series --> ndarray
    17 print(type(rank)) # <class 'numpy.ndarray'>
    18 # Build a type series
    19 type_series = info['Type']
    20 types = type_series.values
    21 # Build a new series based on former two(type and rank)
    22 # Series(values, index=)
    23 series_custom = Series(rank, index=types)
    24 print(series_custom)
    25 # Fetch Series by key name list
    26 print(series_custom[['MILK_14', 'MILK_15']])
    27 # Fetch Series by index
    28 print(series_custom[0:2])
    29 
    30 # Sorted to Series will return a list by sorted value
    31 print(sorted(series_custom, key=lambda x: 0 if isinstance(x, str) else x))
    32 
    33 # Re-sort by index for a Series
    34 original_index = series_custom.index.tolist() 
    35 sorted_index = sorted(original_index)
    36 sorted_by_index = series_custom.reindex(sorted_index)
    37 print(sorted_by_index)
    38 # Series sort function
    39 print(series_custom.sort_index())
    40 print(series_custom.sort_values())
    41 
    42 import numpy as np
    43 # Add operation for Series will add the values for each row(if the dimensions of two series are same)
    44 print(np.add(series_custom, series_custom))
    45 # Apply sin funcion to each value
    46 print(np.sin(info['Number']))
    47 # Return the max value(return a single value not a Series)
    48 # If more than one max value exist, only return one
    49 print(np.max(filter(lambda x: isinstance(x, float), series_custom)))
    50 
    51 # Filter values in range
    52 criteria_one = series_custom > 'C'
    53 criteria_two = series_custom < 'S'
    54 print(series_custom[criteria_one & criteria_two])
    View Code

    分段解释

    利用列名从DataFrame中获取一个Series

    1 import pandas as pd
    2 
    3 info = pd.read_csv('info.csv')
    4 
    5 # Fetch a series from DataFrame
    6 rank_series = info['Rank']
    7 print(type(info)) # <class 'pandas.core.frame.DataFrame'>
    8 print(type(rank_series)) # <class 'pandas.core.series.Series'>
    9 print(rank_series[0:5])

    新建一个Series的方法,先获取一个作为index的列,在获取一个作为values的列,利用Series函数生成新的Series

     1 # New a series
     2 from pandas import Series
     3 # Build a rank series
     4 rank = rank_series.values
     5 print(rank)
     6 # DataFrame --> Series --> ndarray
     7 print(type(rank)) # <class 'numpy.ndarray'>
     8 # Build a type series
     9 type_series = info['Type']
    10 types = type_series.values
    11 # Build a new series based on former two(type and rank)
    12 # Series(values, index=)
    13 series_custom = Series(rank, index=types)
    14 print(series_custom)

    利用列名列表或索引从DataFrame中获取多个Series

    1 # Fetch Series by key name list
    2 print(series_custom[['MILK_14', 'MILK_15']])
    3 # Fetch Series by index
    4 print(series_custom[0:2])

    利用sorted函数根据values大小重排Series,返回值为一个list

    1 # Sorted to Series will return a list by sorted value
    2 print(sorted(series_custom, key=lambda x: 0 if isinstance(x, str) else x))

    两种sort方法对Series进行排列

      1. 获取index索引值,对索引值进行排列,再使用reindex函数获取新的Series

    1 # Re-sort by index for a Series
    2 original_index = series_custom.index.tolist() 
    3 sorted_index = sorted(original_index)
    4 sorted_by_index = series_custom.reindex(sorted_index)
    5 print(sorted_by_index)

      2.使用sort_index或sort_values函数

    1 # Series sort function
    2 print(series_custom.sort_index())
    3 print(series_custom.sort_values())

    Series的相加/正余弦/max,利用numpy函数,将Series的对应values值进行处理

    1 import numpy as np
    2 # Add operation for Series will add the values for each row(if the dimensions of two series are same)
    3 print(np.add(series_custom, series_custom))
    4 # Apply sin funcion to each value
    5 print(np.sin(info['Number']))
    6 # Return the max value(return a single value not a Series)
    7 # If more than one max value exist, only return one
    8 print(np.max(filter(lambda x: isinstance(x, float), series_custom)))

    利用True/False列表获取在范围内满足条件的Series

    1 # Filter values in range
    2 criteria_one = series_custom > 'C'
    3 criteria_two = series_custom < 'S'
    4 print(series_custom[criteria_one & criteria_two])

    6 pandas常用函数

    下面是一些pandas常用的函数示例

    完整代码

      1 import pandas as pd
      2 import numpy as np
      3 
      4 info = pd.read_csv('info.csv')
      5 
      6 # Sort value by column  
      7 # inplace is True will sort value base on origin, False will return a new DataFrame
      8 new = info.sort_values('Mark.', inplace=False, na_position='last')
      9 print(new)
     10 # Sorted by ascending order in default(ascending=True) 
     11 # No matter ascending or descending sort, the NaN(NA, missing value) value will be placed at tail
     12 info.sort_values('Mark.', inplace=True, ascending=False)
     13 print(info)
     14 print('---------')
     15 # Filter off the null row
     16 num = info['Number']
     17 # isnull will return a list contains the status of null or not, True for null, False for not
     18 num_null_true = pd.isnull(num)
     19 print(num_null_true)
     20 num_null = num[num_null_true]
     21 print(num_null) # 12 NaN
     22 print('---------')
     23 
     24 # pivot_table function can calulate certain para that with same attribute group by using certain function
     25 # index tells the method which column to group by
     26 # value is the column that we want to apply the calculation to 
     27 # aggfunc specifies the calculation we want to perform, default function is mean
     28 avg_by_rank = info.pivot_table(index='Rank', values='Number', aggfunc=np.sum)
     29 print(avg_by_rank)
     30 print('---------')
     31 # Operate to multi column
     32 sum_by_rank = info.pivot_table(index='Rank', values=['Number', 'No.'], aggfunc=np.sum)
     33 print(sum_by_rank)
     34 print('---------')
     35 
     36 # dropna function can drop any row/columns that have null values
     37 info = pd.read_csv('info.csv')
     38 # Drop the columns that contain NaN (axis=0 for row)
     39 drop_na_column = info.dropna(axis=1)
     40 print(drop_na_column)
     41 print('---------')
     42 # Drop the row that subset certains has NaN 
     43 # thresh to decide how many valid value required
     44 drop_na_row = info.dropna(axis=0, thresh=1, subset=['Number', 'Info', 'Rank', 'Mark.'])
     45 print(drop_na_row)
     46 print('---------')
     47 # Locate to a certain value by its row number(plus 1 for No.) and column name
     48 print(info)
     49 row_77_Rank = info.loc[77, 'Rank']
     50 print(row_77_Rank)
     51 row_88_Info = info.loc[88, 'Info']
     52 print(row_88_Info)
     53 print('---------')
     54 
     55 # reset_index can reset the index for sorted DataFrame
     56 new_info = info.sort_values('Rank', ascending=False)
     57 print(new_info[0:10])
     58 print('---------')
     59 # drop=True will drop the index column, otherwise will keep former index colunn (default False)
     60 reset_new_info = new_info.reset_index(drop=True)
     61 print(reset_new_info[0:10])
     62 print('---------')
     63 
     64 # Define your own function for pandas
     65 # Use apply function to implement your own function
     66 def hundredth_row(col):
     67     hundredth_item = col.loc[99]
     68     return hundredth_item 
     69 hundred_row = info.apply(hundredth_row, axis=0)
     70 print(hundred_row)
     71 print('---------')
     72 # Null count
     73 # The apply function will act to each column
     74 def null_count(column):
     75     column_null = pd.isnull(column)
     76     null = column[column_null]
     77     return len(null)
     78 # Passing in axis para 0 to iterate over rows instead of column
     79 # Note: 0 for act by row but passing by column, 1 for act by column but passing by row
     80 # Passing by column can act for each column then get row
     81 # Passing by row can act for each row than get column
     82 column_null_count = info.apply(null_count, axis=0)
     83 print(column_null_count)
     84 print('---------')
     85 
     86 # Example: classify the data by Rank, and calculate the sum for each
     87 def rank_sort(row):
     88     rank = row['Rank']
     89     if rank == 'S':
     90         return 'Excellent'
     91     elif rank == 'A':
     92         return 'Great'
     93     elif rank == 'B':
     94         return 'Good'
     95     elif rank == 'C':
     96         return 'Pass'
     97     else:
     98         return 'Failed'
     99 # Format a classified column
    100 rank_info = info.apply(rank_sort, axis=1)
    101 print(rank_info)
    102 print('---------')
    103 # Add the column to DataFrame
    104 info['Rank_Classfied'] = rank_info
    105 # Calculate the sum of 'Number' according to 'Rank_Classfied'
    106 new_rank_number = info.pivot_table(index='Rank_Classfied', values='Number', aggfunc=np.sum)
    107 print(new_rank_number)
    108 
    109 # set_index will return a new DataFrame that is indexed by values in the specified column
    110 # And will drop that column(default is True)
    111 # The column set to be index will not be dropped if drop=False
    112 index_type = info.set_index('Type', drop=False, append=True)
    113 print(index_type)
    114 print('---------')
    115 
    116 # Use string index to slice the DataFrame
    117 # Note: the index(key) should be unique
    118 print(index_type['MILK_1':'MILK_7'])
    119 print('---------')
    120 print(index_type.loc['MILK_1':'MILK_7'])
    121 # Value index is available too
    122 print('---------')
    123 print(index_type[-15:-8])
    124 print('---------')
    125 
    126 # Calculate the standard deviation for each element from two different index
    127 cal_list = info[['Number', 'No.']]
    128 # np.std([x, y]) --> std value
    129 # The lambda x is a Series
    130 # cal_list.apply(lambda x: print(type(x)), axis=1)
    131 print(cal_list.apply(lambda x: np.std(x), axis=1))
    View Code

    分段解释

    首先导入模块,然后利用sort_values函数对DataFrame或Series进行排序操作

     1 mport pandas as pd
     2 import numpy as np
     3 
     4 info = pd.read_csv('info.csv')
     5 
     6 # Sort value by column  
     7 # inplace is True will sort value base on origin, False will return a new DataFrame
     8 new = info.sort_values('Mark.', inplace=False, na_position='last')
     9 print(new)
    10 # Sorted by ascending order in default(ascending=True) 
    11 # No matter ascending or descending sort, the NaN(NA, missing value) value will be placed at tail
    12 info.sort_values('Mark.', inplace=True, ascending=False)
    13 print(info)
    14 print('---------')

    利用isnull函数对null值的数据进行过滤,可利用Series==False对isnull得到的序列进行反转

    1 # Filter off the null row
    2 num = info['Number']
    3 # isnull will return a list contains the status of null or not, True for null, False for not
    4 num_null_true = pd.isnull(num)
    5 print(num_null_true)
    6 num_null = num[num_null_true]
    7 print(num_null) # 12 NaN
    8 print('---------')

    利用pivot_table函数对相同属性分组的数据进行指定函数的计算

     1 # pivot_table function can calulate certain para that with same attribute group by using certain function
     2 # index tells the method which column to group by
     3 # value is the column that we want to apply the calculation to 
     4 # aggfunc specifies the calculation we want to perform, default function is mean
     5 avg_by_rank = info.pivot_table(index='Rank', values='Number', aggfunc=np.sum)
     6 print(avg_by_rank)
     7 print('---------')
     8 # Operate to multi column
     9 sum_by_rank = info.pivot_table(index='Rank', values=['Number', 'No.'], aggfunc=np.sum)
    10 print(sum_by_rank)
    11 print('---------')

    利用dropna函数删除空值数据

     1 # dropna function can drop any row/columns that have null values
     2 info = pd.read_csv('info.csv')
     3 # Drop the columns that contain NaN (axis=0 for row)
     4 drop_na_column = info.dropna(axis=1)
     5 print(drop_na_column)
     6 print('---------')
     7 # Drop the row that subset certains has NaN 
     8 # thresh to decide how many valid value required
     9 drop_na_row = info.dropna(axis=0, thresh=1, subset=['Number', 'Info', 'Rank', 'Mark.'])
    10 print(drop_na_row)
    11 print('---------')

    利用loc对数据进行定位

    1 # Locate to a certain value by its row number(plus 1 for No.) and column name
    2 print(info)
    3 row_77_Rank = info.loc[77, 'Rank']
    4 print(row_77_Rank)
    5 row_88_Info = info.loc[88, 'Info']
    6 print(row_88_Info)
    7 print('---------')

    利用reset_index函数对索引进行重排

    1 # reset_index can reset the index for sorted DataFrame
    2 new_info = info.sort_values('Rank', ascending=False)
    3 print(new_info[0:10])
    4 print('---------')
    5 # drop=True will drop the index column, otherwise will keep former index colunn (default False)
    6 reset_new_info = new_info.reset_index(drop=True)
    7 print(reset_new_info[0:10])
    8 print('---------')

    利用apply函数运行自定义函数

     1 # Define your own function for pandas
     2 # Use apply function to implement your own function
     3 def hundredth_row(col):
     4     hundredth_item = col.loc[99]
     5     return hundredth_item 
     6 hundred_row = info.apply(hundredth_row, axis=0)
     7 print(hundred_row)
     8 print('---------')
     9 # Null count
    10 # The apply function will act to each column
    11 def null_count(column):
    12     column_null = pd.isnull(column)
    13     null = column[column_null]
    14     return len(null)
    15 # Passing in axis para 0 to iterate over rows instead of column
    16 # Note: 0 for act by row but passing by column, 1 for act by column but passing by row
    17 # Passing by column can act for each column then get row
    18 # Passing by row can act for each row than get column
    19 column_null_count = info.apply(null_count, axis=0)
    20 print(column_null_count)
    21 print('---------')
    22 
    23 # Example: classify the data by Rank, and calculate the sum for each
    24 def rank_sort(row):
    25     rank = row['Rank']
    26     if rank == 'S':
    27         return 'Excellent'
    28     elif rank == 'A':
    29         return 'Great'
    30     elif rank == 'B':
    31         return 'Good'
    32     elif rank == 'C':
    33         return 'Pass'
    34     else:
    35         return 'Failed'
    36 # Format a classified column
    37 rank_info = info.apply(rank_sort, axis=1)
    38 print(rank_info)
    39 print('---------')

    添加一个column到DataFrame并进行计算处理

    1 # Add the column to DataFrame
    2 info['Rank_Classfied'] = rank_info
    3 # Calculate the sum of 'Number' according to 'Rank_Classfied'
    4 new_rank_number = info.pivot_table(index='Rank_Classfied', values='Number', aggfunc=np.sum)
    5 print(new_rank_number)

    利用set_index函数设置新的索引,利用索引进行切片操作,切片如果是列名字符串,将返回两个列名索引之间所有的数据

     1 # set_index will return a new DataFrame that is indexed by values in the specified column
     2 # And will drop that column(default is True)
     3 # The column set to be index will not be dropped if drop=False
     4 index_type = info.set_index('Type', drop=False, append=True)
     5 print(index_type)
     6 print('---------')
     7 
     8 # Use string index to slice the DataFrame
     9 # Note: the index(key) should be unique
    10 print(index_type['MILK_1':'MILK_7'])
    11 print('---------')
    12 print(index_type.loc['MILK_1':'MILK_7'])
    13 # Value index is available too
    14 print('---------')
    15 print(index_type[-15:-8])
    16 print('---------')

    对两个不同索引内的元素分别进行标准差计算

    1 # Calculate the standard deviation for each element from two different index
    2 cal_list = info[['Number', 'No.']]
    3 # np.std([x, y]) --> std value
    4 # The lambda x is a Series
    5 # cal_list.apply(lambda x: print(type(x)), axis=1)
    6 print(cal_list.apply(lambda x: np.std(x), axis=1))

    补充内容 / Complement

    1. pandas许多函数底层是基于numpy进行的,pandas一个函数可能调用了numpy的多个函数进行实现;

    2. object dtype 和 Python中的string相同;

    3. pandas中如果不指定列名则默认文件中第一行为列名;

    4. 基本结构包括DataFrame和Series,DataFrame可以分解为Series,DataFrame是由一系列的Series组成的,DataFrame相当于矩阵,Series相当于行或者列。

    相关阅读


    1. numpy 的使用

  • 相关阅读:
    hadoop 2.5 hdfs namenode –format 出错Usage: java NameNode [-backup] |
    自己动手编译hadoop-2.5.2源码
    CentOS Linux解决Device eth0 does not seem to be present
    Liz Murray成功故事的偶然与必然(转)
    【BZOJ4242】水壶(克鲁斯卡尔重构树,BFS)
    【BZOJ3551】Peaks加强版(Kruskal重构树,主席树)
    【agc023E】Inversions(线段树,动态规划)
    【CF183D】T-shirt(动态规划,贪心)
    【BZOJ2423】最长公共子序列(动态规划)
    【BZOJ2118】墨墨的等式(最短路)
  • 原文地址:https://www.cnblogs.com/stacklike/p/8260758.html
Copyright © 2011-2022 走看看