zoukankan      html  css  js  c++  java
  • Pandas汇总和处理缺失数据

    汇总的函数

    方法 说明
    count 非NA的值数量
    describe 针对Series和DataFrame列计算汇总统计
    min、max 计算最小值和最大值
    argmin、argmax 计算能够获取到最小值和最大值的索引位置
    idxmin、indxmax 计算能够获取到最小值和最大值的索引值
    quantile 计算四分位数
    sum 值的总和
    mean 值的平均数
    median 值的算术中位数(第50百分位数)
    mad 根据平均值计算平均绝对离差
    var 样本值的方差
    std 样本值的标准差
    skew 样本值的偏度
    kurt 样本值的峰度
    cumsum 样本值的累计和
    cummin、cummax 累计最大值和累计最小值
    cumprod 累计积
    diff 计算一阶差分(对时间序列很有用)
    pct_change 计算百分数变化
    import numpy as np
    import pandas as pd
    from pandas import Series
    
    df = pd.DataFrame([[1.4, np.nan],[7.1,-4.5],[np.nan,np.nan],[0.75,-1.3]],
                     index=['a','b','c','d'],
                      columns = ['one','two']
                     )
    	one	two
    a	1.40	NaN
    b	7.10	-4.5
    c	NaN	NaN
    d	0.75	-1.3
    # 求每一列和
    df.sum()
    
    one    9.25
    two   -5.80
    dtype: float64
    # 求每一行和
    df.sum(axis=1)
    
    a    1.40
    b    2.60
    c    0.00
    d   -0.55
    dtype: float64
    
    选项 说明
    axis 约简的轴,axis=1代表统计一行的值,默认统计一列的值。
    skipna 排除缺失值,默认值为True
    level 如果轴是层次化索引的,则根据level分组约简
    # 返回最大值最小值的索引
    df.idxmin()
    one    d
    two    b
    dtype: object
    
    # 累积列的和
    df.cumsum()
    ### 汇总分析
    df.describe()
    
    
    	    one	        two
    count	3.000000	2.000000
    mean	3.083333	-2.900000
    std	3.493685	2.262742
    min	0.750000	-4.500000
    25%	1.075000	-3.700000
    50%	1.400000	-2.900000
    75%	4.250000	-2.100000
    max	7.100000	-1.300000
    
    # 对于非数值型数据,describe会产生另一种汇总统计
    obj = Series(['a','a','b','c']*4)
    obj.describe()
    
    count     16
    unique     3
    top        a
    freq       8
    dtype: object
    

    相关系数和协方差

    # 需要先pip install pandas_datareader
    import pandas_datareader.data as web
    
    # 远程获取股票的数据
    all_data={}
    for ticker in ['AAPL','IBM','MSFT']:
        all_data[ticker]=web.get_data_yahoo(ticker,'1/1/2000','1/1/2010')
    all_data
    
    {'AAPL':                  High        Low       Open      Close       Volume  Adj Close
     Date                                                                          
     1999-12-31   3.674107   3.553571   3.604911   3.671875   40952800.0   2.467498
     2000-01-03   4.017857   3.631696   3.745536   3.997768  133949200.0   2.686497
     2000-01-04   3.950893   3.613839   3.866071   3.660714  128094400.0   2.459998
     2000-01-05   3.948661   3.678571   3.705357   3.714286  194580400.0   2.495997
     2000-01-06   3.821429   3.392857   3.790179   3.392857  191993200.0   2.279998
     2000-01-07   3.607143   3.410714   3.446429   3.553571  115183600.0   2.387998
     2000-01-10   3.651786   3.383929   3.642857   3.491071  126266000.0   2.345998
     2000-01-11   3.549107   3.232143   3.426339   3.312500  110387200.0   2.225998
     2000-01-12   3.410714   3.089286   3.392857   3.113839  244017200.0   2.092498
     2000-01-13   3.526786   3.303571   3.374439   3.455357  258171200.0   2.321998
     2000-01-14   3.651786   3.549107   3.571429   3.587054   97594000.0   2.410497
     2000-01-18   3.785714   3.587054   3.607143   3.712054  114794400.0   2.494497
     2000-01-19   3.883929   3.691964   3.772321   3.805804  149410800.0   2.557498
     2000-01-20   4.339286   4.053571   4.125000   4.053571  457783200.0   2.723998
     2000-01-21   4.080357   3.935268   4.080357   3.975446  123981200.0   2.671498
     2000-01-24   4.026786   3.754464   3.872768   3.794643  110219200.0   2.549998
     2000-01-25   4.040179   3.656250   3.750000   4.008929  124286400.0   2.693998
     2000-01-26   4.078125   3.919643   3.928571   3.935268   91789600.0   2.644498
     2000-01-27   4.035714   3.821429   3.886161   3.928571   85036000.0   2.639997
     2000-01-28   3.959821   3.593750   3.863839   3.629464  105837200.0   2.438998
     2000-01-31   3.709821   3.375000   3.607143   3.705357  175420000.0   2.489997
     2000-02-01   3.750000   3.571429   3.714286   3.580357   79508800.0   2.405998
     2000-02-02   3.647321   3.464286   3.598214   3.529018  116048800.0   2.371498
     2000-02-03   3.723214   3.580357   3.582589   3.689732  118798400.0   2.479497
     2000-02-04   3.928571   3.700893   3.712054   3.857143  106330000.0   2.591997
     2000-02-07   4.080357   3.783482   3.857143   4.073661  110266800.0   2.737498
     2000-02-08   4.147321   3.973214   4.071429   4.102679  102160800.0   2.756998
     2000-02-09   4.183036   4.015625   4.075893   4.022321   74841200.0   2.702997
     2000-02-10   4.066964   3.928571   4.031250   4.053571   75745600.0   2.723998
     2000-02-11   4.075893   3.866071   4.058036   3.883929   53062800.0   2.609998
     ...               ...        ...        ...        ...          ...        ...
     2009-11-18  29.571428  29.142857  29.505714  29.422857   93580200.0  19.772139
     2009-11-19  29.230000  28.542856  29.230000  28.644285  135581600.0  19.248940
     2009-11-20  28.627142  28.251429  28.450001  28.559999  101666600.0  19.192299
     2009-11-23  29.428572  28.992857  29.000000  29.411428  118724200.0  19.764460
     2009-11-24  29.411428  28.985714  29.332857  29.205715   79609600.0  19.626223
     2009-11-25  29.378571  29.108572  29.342857  29.170000   71613500.0  19.602222
     2009-11-27  28.994286  28.338572  28.459999  28.655714   73814300.0  19.256622
     2009-11-30  28.811428  28.395714  28.730000  28.558571  106214500.0  19.191339
     2009-12-01  28.967142  28.118572  28.891428  28.138571  116440800.0  18.909100
     2009-12-02  28.774286  27.964285  28.422857  28.032858  178815000.0  18.838060
     2009-12-03  28.425714  28.038572  28.202858  28.068571  112179900.0  18.862062
     2009-12-04  28.554285  27.182858  28.528572  27.617144  206721200.0  18.558701
     2009-12-07  27.681429  26.954287  27.617144  26.992857  178689700.0  18.139185
     2009-12-08  27.478571  26.957144  27.051428  27.124287  172599700.0  18.227499
     2009-12-09  28.308571  27.187143  27.325714  28.257143  171195500.0  18.988783
     2009-12-10  28.528572  28.017143  28.500000  28.061428  122417400.0  18.857262
     2009-12-11  28.285715  27.632856  28.254286  27.809999  107443700.0  18.688299
     2009-12-14  28.204287  27.508572  27.910000  28.139999  123947600.0  18.910061
     2009-12-15  28.215714  27.610001  27.975714  27.738571  104864900.0  18.640306
     2009-12-16  28.071428  27.792856  27.871429  27.861429   88246200.0  18.722862
     2009-12-17  27.857143  27.285715  27.751429  27.408571   97209700.0  18.418543
     2009-12-18  27.928572  27.514286  27.595715  27.918571  152192600.0  18.761261
     2009-12-21  28.535715  27.952858  28.007143  28.318571  152976600.0  19.030060
     2009-12-22  28.692858  28.379999  28.491428  28.622858   87378900.0  19.234541
     2009-12-23  28.911428  28.687143  28.742857  28.871429   86381400.0  19.401581
     2009-12-24  29.907143  29.049999  29.078571  29.862858  125222300.0  20.067825
     2009-12-28  30.564285  29.944286  30.245714  30.230000  161141400.0  20.314537
     2009-12-29  30.388571  29.818571  30.375713  29.871429  111301400.0  20.073582
     2009-12-30  30.285715  29.758572  29.832857  30.234285  103021100.0  20.317423
     2009-12-31  30.478571  30.080000  30.447144  30.104286   88102700.0  20.230061
     
     [2516 rows x 6 columns],
    
    
     price = pd.DataFrame({tic:data['Adj Close'] for tic, data in all_data.items()})
    volume = pd.DataFrame({tic:data['Volume'] for tic, data in all_data.items()})
    
    # 计算价格的百分位数变化
    returns = price.pct_change()
    # 结尾取5个,取局部的
    returns.tail()
    
                    AAPL	        IBM	              MSFT
        Date			
    2009-12-24	0.034340	0.004385	0.002587
    2009-12-28	0.012294	0.013326	0.005484
    2009-12-29	-0.011861	-0.003477	0.007058
    2009-12-30	0.012147	0.005461	-0.013698
    2009-12-31	-0.004300	-0.012597	-0.015504
    

    Series的corr方法用于计算两个Serires重叠的、非NA的、按索引对齐的值的相关系数。COV用于计算协方差。

    # 计算MSFT和IBM的相关系数
    returns.MSFT.corr(returns.IBM)
    0.49253706924390156
    
    # 计算MSFT和IBM的协方差
    returns.MSFT.cov(returns.IBM)
    0.00021557776646297303
    
    DataFrame的corr和cov方法,可以统计出任意两者之间的相关系数和协方差
    returns.corr()
    
                AAPL	IBM	        MSFT
    AAPL	1.000000	0.412392	0.422852
    IBM	0.412392	1.000000	0.492537
    MSFT	0.422852	0.492537	1.000000
    
    # 返回单列的IBM与另外三者之间的相关系数
    returns.corrwith(returns.IBM)
    AAPL    0.412392
    IBM     1.000000
    MSFT    0.492537
    dtype: float64
    
    # 百分比变化和成交量的相关系数
    
    returns.corrwith(volume)
    AAPL   -0.057665
    IBM    -0.006592
    MSFT   -0.016101
    dtype: float64
    
    returns.cov()
    
                AAPL	IBM	        MSFT
    AAPL	0.001030	0.000254	0.000309
    IBM	0.000254	0.000369	0.000216
    MSFT	0.000309	0.000216	0.000519
    

    唯一值(unique),返回一个Series,其索引为唯一值,其值为频率

    obj = Series(['c','a','d','a','a','b','b','c','c'])
    uniques = obj.unique()
    uniques
    
    array(['c', 'a', 'd', 'b'], dtype=object)
    
    # 对返回的结果可以进行排序
    uniques.sort()
    uniques
    
    array(['a', 'b', 'c', 'd'], dtype=object)
    
    # 统计值出现的次数,为了方便查看按值频率降序排序
    obj.value_counts()
    
    a    3
    c    3
    b    2
    d    1
    dtype: int64
    
    # pandas中需要将一维数组传入进去
    pd.value_counts(obj.values)
    
    a    3
    c    3
    b    2
    d    1
    dtype: int64
    

    成员资格isin 计算一个表示"是否在数组中"布尔型数组

    mask = obj.isin(['b','c'])
    # 第一步先选出是否在,返回bool
    mask
    
    0     True
    1    False
    2    False
    3    False
    4    False
    5     True
    6     True
    7     True
    8     True
    dtype: bool
    
    # 选择真值
    obj[mask]
    
    0    c
    5    b
    6    b
    7    c
    8    c
    dtype: object
    
    
    data = pd.DataFrame({'Qu1':[100,300,400,300,400],
                        'Qu2':[200,300,100,200,300],
                         'Qu3':[100,500,200,400,400]
                        })
    data
    
            Qu1	Qu2	Qu3
    0	100	200	100
    1	300	300	500
    2	400	100	200
    3	300	200	400
    4	400	300	400
    
    p=data.apply(pd.value_counts).fillna(0)
    p
    
            Qu1	Qu2	Qu3
    100	1.0	1.0	1.0
    200	0.0	2.0	1.0
    300	2.0	2.0	0.0
    400	2.0	0.0	2.0
    500	0.0	0.0	1.0
    

    处理缺失数据

    NA处理方法

    方法 说明
    dropna 根据各标签的值中是否存在缺失数据对轴标签进行过滤,可通过阈值调节对缺失值的容忍度
    fillna 用指定值或插值方法填充缺失数据
    isnull 返回一个含有布尔值的对象,这些布尔值表示哪些值是缺失值
    notnull isnull的否定式

    过滤掉

    from numpy import nan as NA
    data = pd.DataFrame([[1,6.5,3],[1,NA,NA],[NA,NA,NA],[NA,6.5,3]])
    data
    
            0	1	2
    0	1.0	6.5	3.0
    1	1.0	NaN	NaN
    2	NaN	NaN	NaN
    3	NaN	6.5	3.0
    
    # 默认会把所有的有NA的行删除
    data.dropna()
            0	1	2
    0	1.0	6.5	3.0
    
    # 传入how='all'将只丢弃全为NA的行
    data.dropna(how='all')
    
            0	1	2
    0	1.0	6.5	3.0
    1	1.0	NaN	NaN
    3	NaN	6.5	3.0
    
    data[4]=NA
    data
            0	1	2	4
    0	1.0	6.5	3.0	NaN
    1	1.0	NaN	NaN	NaN
    2	NaN	NaN	NaN	NaN
    3	NaN	6.5	3.0	NaN
    
    # axis=1 删除全部为NA的列
    data.dropna(axis=1,how='all')
    
            0	1	2
    0	1.0	6.5	3.0
    1	1.0	NaN	NaN
    2	NaN	NaN	NaN
    3	NaN	6.5	3.0
    
    df = pd.DataFrame(np.random.randn(7,5))
    df.ix[:4,1] = NA
    df.ix[:2,2] = NA
    
    
            0	        1	        2	                3	                4
    0	0.758567	NaN	        NaN	                -0.064858	    -0.385678
    1	-0.275586	NaN	        NaN	                -0.184934	    -0.253343
    2	-1.872585	NaN	        NaN	                -1.539924	    0.794054
    3	1.092201	NaN	        0.250026	        -0.654255	    -0.016992
    4	0.625871	NaN	        -1.418505	        1.141008	    3.188408
    5	-0.714581	0.423811	-0.799328	        -1.010573	    -0.959299
    6	0.887836	1.412723	-0.405043	        -0.417018	    -1.114318
    
    
    #这里的thresh函数是选取最少non-NA值个数的行选出来
    df.dropna(thresh=5)
    
            0	        1	           2	       3	            4
    5	-0.714581	0.423811	-0.799328	-1.010573	-0.959299
    6	0.887836	1.412723	-0.405043	-0.417018	-1.114318
    
    
    #### 填充缺失数据,如果不想过滤的话
    df.fillna(0)
    # 字典调用fillna,对不同的列填充不同的值
    df.fillna({1:'第一列',2:'第二列'})
    
    
            0	           1	     2	          3	           4
    0	0.758567	第一列	    第二列	 -0.064858	  -0.385678
    1	-0.275586	第一列	    第二列	  -0.184934	  -0.253343
    2	-1.872585	第一列	    第二列	  -1.539924	  0.794054
    3	1.092201	第一列	    0.250026	  -0.654255	  -0.016992
    4	0.625871	第一列	    -1.41851	  1.141008	  3.188408
    5	-0.714581	0.423811   -0.799328	  -1.010573	  -0.959299
    6	0.887836	1.41272	   -0.405043	  -0.417018	  -1.114318
    
    # 上面的都是返回值为新的对象,如果直接向对原数据修改,inplace = True
    df.fillna({1:'第一列',2:'第二列'},inplace=True)
    
    # 向上,向下填充(bfill,ffill)
    df = pd.DataFrame(np.random.randn(6,3))
    
                0	         1	            2
    0	2.040458	-2.276695	-1.038916
    1	0.427078	NaN	        0.001678
    2	1.798042	NaN	        -0.839205
    3	-0.433214	-0.312427	NaN
    4	0.041802	1.356883	NaN
    5	-0.904390	-1.030643	1.507598
    
    df.fillna(method='ffill',limit=1)
    
              0	         1	            2
    0	2.040458	-2.276695	-1.038916
    1	0.427078	-2.276695	 0.001678
    2	1.798042	NaN	        -0.839205
    3	-0.433214	-0.312427	-0.839205
    4	0.041802	1.356883	NaN
    5	-0.904390	-1.030643	1.507598
    
    
    #还可以传入Series的平均值或中位数
    data = Series([1,NA,3.5])
    data.fillna(data.mean())
    
    0    1.00
    1    2.25
    2    3.50
    dtype: float64
    
  • 相关阅读:
    线程同步的方法
    为什么HashMap中key是引用类型而不是基本数据类型?为什么有了基本数据类型还有包装类型?
    使用MyBatis的mapper接口(动态代理对象)调用时的注意点
    redis的aof持久化模式
    redis的RDB持久化方式的优缺点
    快排算法
    JAVA8新特性
    NIO中Buffer的capacity,position和limit含义
    ArrayBlockingQueue与LinkedBlockingQueue对比
    写加锁但读没有加锁造成的脏读问题
  • 原文地址:https://www.cnblogs.com/lishi-jie/p/9922402.html
Copyright © 2011-2022 走看看