zoukankan      html  css  js  c++  java
  • python 正则表达式匹配中文

    #!/usr/bin/python
    #
    -*- coding:cp936-*-

    #思路,将str转换成unicode,方可用正则表达式,前提是,要知道文件的编码,本例中是gbk
    import cPickle as mypickle
    import re
    import sys
    if (__name__=='__main__'):
        fid1=file('above50purenames.txt','r');
        p=re.compile('(^\s+|\s+$)');
        phanzigbk=re.compile('[\\x20-\\x7f]');
        phanzi=re.compile(u'[\u4e00-\u9fa5]');#这里要加u,注意
        commlines=fid1.readlines();
        fid1.close();
        dictfamilyname={};
        dictfirstname={};
        for line in commlines:
            line=p.sub('',line);
            print type(line);
            print line;
            uline=unicode(line,'gbk');
            print type(uline);
            candidates=phanzi.findall(uline);

            print len(candidates);
            if(len(candidates)==2):
                print candidates[0];
                familynamegbk=candidates[0].encode('gbk');#把unicode型的变量变成str型的变量
                firstnamegbk=candidates[1].encode('gbk');
                if(dictfamilyname.has_key(familynamegbk)):
                    dictfamilyname[familynamegbk]=dictfamilyname[familynamegbk]+1;
                else:
                    dictfamilyname[familynamegbk]=1;
            
                if(dictfirstname.has_key(firstnamegbk)):
                    dictfirstname[firstnamegbk]=dictfirstname[firstnamegbk]+1;
                else:
                    dictfirstname[firstnamegbk]=1;

        familynameitems=dictfamilyname.items();
        print familynameitems;
        firstnameitems=dictfirstname.items();
        familynameitems.sort(key=lambda d:d[1],reverse=True);
        firstnameitems.sort(key=lambda d :d[1],reverse=True);
        fid=file('familyname.txt','w');
        for m in familynameitems:
            s=m[0]+'\t'+str(m[1]);
            fid.write(s);
            fid.write('\n');
        fid.close();
        fid=file('firstname.txt','w');
        for m in firstnameitems:
            s=m[0]+'\t'+str(m[1]);
            fid.write(s);
            fid.write('\n');
        fid.close();
        print 'finish'
       

  • 相关阅读:
    mybatis-generator自动生成dao,mapping,model
    cent os 6.5+ambari+HDP集群安装
    cent os 6.5 配置vsftpd
    Ambari修改主页面方法
    Maven .m2 epositoryjdk ools1.7 missing
    Chrome封掉不在chrome商店中的插件解决办法
    Hadoop 读取文件API报错
    Hadoop创建/删除文件夹出错
    mysql性能测试(索引)
    Greys--JVM异常诊断工具
  • 原文地址:https://www.cnblogs.com/finallyliuyu/p/2340685.html
Copyright © 2011-2022 走看看