zoukankan      html  css  js  c++  java
  • python 分词

    #encoding=utf-8
    import jieba
    
    seg_list = jieba.cut("明天不上班啊",cut_all=True)
    print ("Full Mode:", "/ ".join(seg_list))
    
    seg_list = jieba.cut("明天不上班啊",cut_all=False)
    print ("Default Mode:", "/ ".join(seg_list))
    
    seg_list = jieba.cut("明天不上班啊")
    print (", ".join(seg_list))

    打印结果:

    F:python-studyfenci>python test.py
    Building prefix dict from C:Python33libsite-packagesjiebadict.txt ...
    Loading model from cache c:userszhaoji~1appdatalocal empjieba.cache
    Loading model cost 0.840 seconds.
    Prefix dict has been built succesfully.
    Full Mode: 明天/ 不/ 上班/ 啊
    Default Mode: 明天/ 不/ 上班/ 啊
    明天, 不, 上班, 啊

    python分词工具:jieba

    1、运行后错误:

    F:python-studyfenci>python test.py
    File "test.py", line 3
    SyntaxError: Non-UTF-8 code starting with 'xce' in file test.py on line 3, but
    no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

    查询资料,发现是编辑的编码问题,notepad打开,下边显示ansi,需要设置 转换为utf-8即可

    2、python 3的print需要增加括号

    print()

    测试:

    #coding=utf-8
    import jieba
    import jieba.posseg as pseg
    
    f=open("in.txt","r") #读取文本
    string=f.read()
    
    words = pseg.cut(string) #进行分词
    result=""  
    for w in words:
         result+= str(w.word)+"/"+str(w.flag) #加词性标注
    
    f=open("out.txt","w") 
    f.write(result)
    f.close()
  • 相关阅读:
    POJ 2752 KMP中next数组的理解
    KMP详解
    HDU 3221 矩阵快速幂+欧拉函数+降幂公式降幂
    POJ 3220 位运算+搜索
    反素数深度分析
    POJ 2886 线段树单点更新
    求反素数的方法
    CV第八课 GPU/CPU
    49. 字母异位词分组
    48. 旋转图像
  • 原文地址:https://www.cnblogs.com/huanhuanang/p/4750343.html
Copyright © 2011-2022 走看看