用apply处理pandas比用for循环,快了无数倍,测试如下:
我们有一个pandas加载的dataframe如下,features是0和1特征的组合,可惜都是str形式(字符串形式),我们要将其转换成一个装有整型int 0和1的list
(1)用for循坏(耗时约3小时)
1 from tqdm import tqdm #计时器函数 2 for i in tqdm(range(df.shape[0])): 3 df['features'][i] = df['features'][i].split(",") #每一行形如0,0,1,1,0,1,1的string,所以按照逗号切割,返回一个list 4 for j in range(len(df['features'][i])): #遍历该list,对于每个元素进行int转换 5 df['features'][i][j] = int(df['features'][i][j]) 6 7 print(type(df['features'][0]))
(2)推荐用apply方法(耗时约30秒)
1 from time import time 2 from tqdm import tqdm 3 4 def func(x): 5 l = x.split(",") 6 for i in range(len(l)): 7 l[i] = int(l[i]) 8 return l 9 10 stime = time() 11 df['new_features'] = df['features'].apply(func) 12 endtime = time() 13 14 print("time:"+str(endtime-stime)+"s") 15 #df.head() 16 print("over")