zoukankan      html  css  js  c++  java
  • 用Python 3.5结巴分词包做词频统计

    工作中有的时候需要对文本进行拆分,然后分析词频,分词用结巴分词做了一个简单的,代码如下:

    import pandas  ##引入pandas包
    from pandas import Series as sr, DataFrame as df  ##从pandas包引入Series与DataFrame格式
    from collections import Counter as cr  ##引入Counter进行计数
    import jieba.posseg as pseg  ##引入结巴分词词性标注
    
    path = ''  ##读取文件路径
    data1 = df.read_csv(path,sep= )  ## sep后填文件间隔符,csv一般为'	'
    l = len(data1)
    df1=df(columns=['word','type'])
    for i in range(l):
        words = pseg.cut(data1.ix[i][x]) ##x填写要分词的内容所在列数-1
        for t in words:
            df2 = pd.DataFrame([t.word,t.flag], columns=data2.columns)
            df1.append(df2,ignore_index=True)
    df3=df1.groupby(['word','type']).count()
    

      

  • 相关阅读:
    [bzoj3524]Couriers
    [bzoj2789]Letters
    [bzoj4318]OSU!
    [luogu4570]元素
    [hdu6600]Just Skip The Problem
    [bzoj5025]单调上升路径
    [bzoj4557]侦察守卫
    [hdu5379]Mahjong tree
    [bzoj2957]楼房重建
    [noi253]A
  • 原文地址:https://www.cnblogs.com/helloxia/p/6374198.html
Copyright © 2011-2022 走看看