zoukankan      html  css  js  c++  java
  • 数据挖掘:python数据清洗cvs里面带中文字符

      数据清洗,使用python数据清洗cvs里面带中文字符,意图是用字典对应中文字符,即key值是中文字符,value值是index,自增即可;利用字典数据结构没有重复key值的特性,把中文字符映射到了数值index。

      python代码如下:(data数据时csv格式)

    import csv

    dict2 = {}      #C
    dict4 = {}      #E
    dict25 = {}     #z
    dict26 = {}     #AA
    dict27 = {}     #AB
    dict37 = {}     #AL
    dict38 = {}     #AM
    dict40 = {}     #AO
    dict41 = {}     #AP
    dict42 = {}     #AQ
    dict45 = {}     #AT
    dict49 = {}     #AX
    index = 0
    flag = False

    #        print(row[2],dict[row[2]])

    with open("E:/test/real/test.csv", 'w+', newline='') as csv_file_write:
            writer = csv.writer(csv_file_write)
            with open('E:/test/real/b.csv', 'r', newline='') as csv_file_read:
                reader = csv.reader(csv_file_read)
                for row in reader:
                    if(flag):
                        if row[2] not in dict2.keys():
                            dict2[row[2]] = index
                        if row[4] not in dict4.keys():
                            dict4[row[4]] = index
                        if row[25] not in dict25.keys():
                            dict25[row[25]] = index
                        if row[26] not in dict26.keys():
                            dict26[row[26]] = index
                        if row[27] not in dict27.keys():
                            dict27[row[27]] = index
                        if row[37] not in dict37.keys():
                            dict37[row[37]] = index
                        if row[38] not in dict38.keys():
                            dict38[row[38]] = index
                        if row[40] not in dict40.keys():
                            dict40[row[40]] = index
                        if row[41] not in dict41.keys():
                            dict41[row[41]] = index
                        if row[42] not in dict42.keys():
                            dict42[row[42]] = index
                        if row[45] not in dict45.keys():
                            dict45[row[45]] = index
                        if row[49] not in dict49.keys():
                            dict49[row[49]] = index
                        row[2] = dict2[row[2]]
                        row[4] = dict4[row[4]]
                        row[25] = dict25[row[25]]
                        row[26] = dict26[row[26]]
                        row[27] = dict27[row[27]]
                        row[37] = dict37[row[37]]
                        row[38] = dict38[row[38]]
                        row[40] = dict40[row[40]]
                        row[41] = dict41[row[41]]
                        row[42] = dict42[row[42]]
                        row[45] = dict45[row[45]]
                        row[49] = dict49[row[49]]
                        index = index + 1
                    writer.writerow(row)
                    flag = True
            csv_file_read.close()
    csv_file_write.close()

    print('done!')



      上例是真实的数据处理,有两百列属性,三万条数据的原始数据。其中包括中文字符,及缺失值,需要一步步清洗。

      备注:发生异常permission denied异常;

      解决方案: 是因为正在打开着csv文件,所以python没有权限以w的方式打开文件。关闭该文件即可;

  • 相关阅读:
    UCenter创始人、Discuz!创始人、Discuz!管理员账号区别
    Linux命令:ps ef |grep java
    ASP.NET结合数据库,发送邮件找回密码
    NET中Eval()方法大全
    必将改变Web的五大技术
    为DataGridTextColumn设置表头样式和单元格样式
    string to float
    .net之生成图表的控件(柱状图,曲线图,饼状图) [转]
    判断一个string是否可以为数字
    从零开始做一个开源项目 学习笔记
  • 原文地址:https://www.cnblogs.com/rongyux/p/5404917.html
Copyright © 2011-2022 走看看