zoukankan      html  css  js  c++  java
  • 数据清洗与准备

    数据清洗与准备

    1、抽样:

    import numpy as np
    import pandas as pd
    
    choices = pd.Series([5,7,-1,6,4])
    draws = choices.sample(n=10, replace=True)
    draws
    

     OUT:

    0    5
    1    7
    3    6
    2   -1
    4    4
    4    4
    4    4
    2   -1
    3    6
    2   -1
    dtype: int64

    2、分割:
    x = 'a|b|c'
    x.split('|')
    ['a', 'b', 'c']

    3、取唯一值:
    l1 = ['a','a', 'c','b',  'b', 'c','c']
    pd.unique(l1)
    
    array(['a', 'c', 'b'], dtype=object)

    4、索引取值:
    data = pd.DataFrame(np.arange(16).reshape((4,4)), index=['one','two','three','four'], columns=['a','b','c','d'])
    data
    data.columns.get_indexer(['c','a', 'b' ])
    

     

     abcd
    one 0 1 2 3
    two 4 5 6 7
    three 8 9 10 11
    four 12 13 14 15
    array([2, 0, 1])
    data.iloc[1,data.columns.get_indexer(['c','a', 'b' ])] =88
    data
    

      

     abcd
    one 0 1 2 3
    two 88 88 88 7
    three 8 9 10 11
    four 12 13 14 15
    value = data.iloc[:2,data.columns.get_indexer(['c','a', 'b' ])]
    value
    

      

     cab
    one 2 0 1
    two 88 88 88
    value2 = data.loc[['one','two'],['c','a', 'b' ]]
    value2
    

      

     cab
    one 2 0 1
    two 88 88 88

    5、筛选行与列

    data = pd.DataFrame(np.arange(16).reshape((4,4)), index=['one','two','three','four'], columns=['a','b','c','d'])
    data
    

      

     abcd
    one 0 1 2 3
    two 4 5 6 7
    three 8 9 10 11
    four 12 13 14 15
    data > 5
    

      

     abcd
    one False False False False
    two False False True True
    three True True True True
    four True True True True
    data[data>5]
    

      

     abcd
    one NaN NaN NaN NaN
    two NaN NaN 6.0 7.0
    three 8.0 9.0 10.0 11.0
    four 12.0 13.0 14.0 15.0
    data[(data>5).any(1)]#轴1方向上,选出值大于5的行(至少有一个值大于5)
    

      

     abcd
    two 4 5 6 7
    three 8 9 10 11
    four 12 13 14 15
    (data>5).any(0)  #轴0方向上,是否有值大于5的列
    

    a True

    b    True
    c    True
    d    True 

    dtype: bool 

    data.loc[:,(data>5).any(0)]#选出值大于5的列(至少有一个值大于5)
    

      

     abcd
    one 0 1 2 3
    two 4 5 6 7
    three 8 9 10 11
    four 12 13 14 15
  • 相关阅读:
    缓存架构设计细节二三事
    数据库软件架构设计些什么
    100亿数据1万属性数据架构设计
    这才是真正的表扩展方案
    啥,又要为表增加一列属性?
    究竟啥才是互联网架构“高可用”
    究竟啥才是互联网架构“高并发”
    Linux用过的命令
    Xshell远程连接工具
    oracle分组取第一条
  • 原文地址:https://www.cnblogs.com/djlbolgs/p/12507162.html
Copyright © 2011-2022 走看看