zoukankan      html  css  js  c++  java
  • 如何迭代pandas dataframe的行

    from:https://blog.csdn.net/tanzuozhev/article/details/76713387

    How to iterate over rows in a DataFrame in Pandas-DataFrame按行迭代

    https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas

    http://stackoverflow.com/questions/7837722/what-is-the-most-efficient-way-to-loop-through-dataframes-with-pandas

    在对DataFrame进行操作时,我们不可避免的需要逐行查看或操作数据,那么有什么高效、快捷的方法呢?

    index序号索引

    import pandas as pd
    inp = [{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}]
    df = pd.DataFrame(inp)
    for x in xrange(len(df.index)):
        print df['c1'].iloc[x]

    这似乎是最常规的办法,而且可以在迭代的过程中对DataFrame进行操作。

    enumerate

    for i, row in enumerate(df.values):
        index= df.index[i]
        print row

    df.values 是 numpy.ndarray 类型
    这里 i 是index的序号, row是numpy.ndarray类型。

    iterrows

    https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iterrows.html

    import pandas as pd
    inp = [{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}]
    df = pd.DataFrame(inp)
    
    for index, row in df.iterrows():
        print row['c1'], row['c2']
    
    #10 100
    #11 110
    #12 120

    df.iterrows() 的每次迭代都是一个tuple类型,包含了index和每行的数据。

    1. 采用iterrows的方法,得到的 row 是一个Series,DataFrame的dtypes不会被保留。
    2. 返回的Series只是一个原始DataFrame的复制,不可以对原始DataFrame进行修改;

    itertuples

    http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.itertuples.html

    import pandas as pd
    inp = [{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}]
    df = pd.DataFrame(inp)
    
    for row in df.itertuples():
        # print row[0], row[1], row[2] 等同于
        print row.Index, row.c1, row.c2

    itertuples 返回的是一个 pandas.core.frame.Pandas 类型。

    普遍认为itertuples 比 iterrows的速度要快。

    zip / itertools.izip

    zip 和 itertools.izip的用法是相似的, 但是zip返回一个list,而izip返回一个迭代器。 如果数据量很大,zip的性能不及izip

    from itertools import izip
    import pandas as pd
    inp = [{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}]
    df = pd.DataFrame(inp)
    
    for row in izip(df.index, df['c1'], df['c2']):
        print row

    时间测评

    import time
    from numpy.random import randn
    
    df = pd.DataFrame({'a': randn(100000), 'b': randn(100000)})
    
    time_stat = []
    
    # range(index)
    test_list = []
    t = time.time()
    for r in xrange(len(df)):
        test_list.append((df.index[r], df.iloc[r,0], df.iloc[r,1]))
    time_stat.append(time.time()-t)
    
    # enumerate
    test_list = []
    t = time.time()
    for i, r in enumerate(df.values):
        test_list.append((df.index[i], r[0], r[1]))
    time_stat.append(time.time()-t)
    
    # iterrows
    test_list = []
    t = time.time()
    for i,r in df.iterrows():
        test_list.append((df.index[i], r['a'], r['b']))
    time_stat.append(time.time()-t)
    
    #itertuples
    test_list = []
    t = time.time()
    for ir in df.itertuples():
        test_list.append((ir[0], ir[1], ir[2]))    
    time_stat.append(time.time()-t)
    
    # zip
    test_list = []
    t = time.time()
    for r in zip(df.index, df['a'], df['b']):
        test_list.append((r[0], r[1], r[2]))
    time_stat.append(time.time()-t)
    
    # izip
    test_list = []
    t = time.time()
    from itertools import izip
    for r in izip(df.index, df['a'], df['b']):
        test_list.append((r[0], r[1], r[2]))
    time_stat.append(time.time()-t)
    
    time_df = pd.DataFrame({'items':['range(index)', 'enumerate',  'iterrows', 'itertuples' , 'zip', 'izip'], 'time':time_stat})
    
    time_df.sort_values('time')
    
    
    items   time
    5   izip    0.034869
    4   zip 0.040440
    3   itertuples  0.072604
    1   enumerate   0.174094
    2   iterrows    4.026293
    0   range(index)    21.921407

    可以发现在时间花销上, izip > zip > itertuples > enumerate > iterrows > range(index)

  • 相关阅读:
    js正则表达式中的问号使用技巧总结
    380. Insert Delete GetRandom O(1)
    34. Find First and Last Position of Element in Sorted Array
    162. Find Peak Element
    220. Contains Duplicate III
    269. Alien Dictionary
    18. 4Sum
    15. 3Sum
    224. Basic Calculator
    227. Basic Calculator II
  • 原文地址:https://www.cnblogs.com/bonelee/p/9732761.html
Copyright © 2011-2022 走看看