源数据中的稀疏矩阵保存在文本文档中,成了文本格式,要还原成矩阵形式处理
源数据格式:
[u'0b85922424fb39bb566723aa3d71c', u'7afb428b3dc75', u'1', u'42-44', u'SM-N7506V', SparseVector(5834, {0: 1.0, 9: 1.0, 11: 1.0, 14: 1.0, 29: 1.0, 42: 1.0, 59: 1.0, 180: 1.0, 617: 1.0, 639: 1.0, 1356: 1.0})]
[u'08d242d4f0baba3aa6feb9ad5ea', u'3c02f9965966c117fd3f', u'0', u'33-35', u'P7', SparseVector(5834, {11: 1.0, 45: 1.0, 249: 1.0, 363: 1.0, 405: 1.0, 456: 1.0, 710: 1.0, 802: 1.0, 1053: 1.0, 4340: 1.0})]
[u'cabee1431f8bb3cf5080851835', u'a2d6926a05cc7ff70288', u'1', u'27-29', u'OPPO R9tm', SparseVector(5834, {1: 1.0, 20: 1.0, 30: 1.0, 39: 1.0, 42: 1.0, 54: 1.0, 56: 1.0, 60: 1.0, 108: 1.0, 282: 1.0, 327: 1.0, 408: 1.0, 1795: 1.0, 1907: 1.0, 2287: 1.0})]
处理过程
import os
import numpy as np
out=[]
f_tain=open("installed_applist_sample",'r')
for line in f_tain.readlines()[0:3]:
out1=[0]*5
line=line.replace('SparseVector','')
samp=eval(line)
out1[0:5]=samp[0:5]
mat=samp[5]
lenvec=mat[0]
dic1=mat[1]
klist=list(dic1.keys())
for i in range(lenvec):
if i in klist:
out1.append(1)
else:
out1.append(0)
out.append(out1)
f_tain.close()
还原后的数据形式:
[[u'85922424fb39bb566723aa3d71c', u'e104d981a7afb428b3dc75', u'1', u'42-44', u'SM-N7506V', 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...],[u'08d242d4f0baba3aa6feb9ad5ea', u'33c02f9965966c117fd3f', u'0', u'33-35', u'P7', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...] ...]