zoukankan      html  css  js  c++  java
  • python库之jieba小试牛刀 2

    1 词频统计、降序排序

    # encoding=utf-8
    import jieba
    from click._compat import raw_input
    
    # read need analyse file
    article = open("freq_wrod_test.txt", "r").read()
    words = jieba.cut(article, cut_all = False)
    #  count word freq
    word_freq = {}
    for word in words:
        if word in word_freq:
            word_freq[word] += 1
        else:
            word_freq[word] = 1
    #  sorted     
    freq_word = []
    for word, freq in word_freq.items():
        freq_word.append((word, freq))
    freq_word.sort(key = lambda x:x[1], reverse=True)
    max_number = int(raw_input("需要前多少位高频词? "))
    # display
    for word, freq in freq_word[:max_number]:
        print(word, freq)
    

    结果

    需要前多少位高频词? 10
    , 59
    的 53
    。 46
    - 33
      26
    编程 16
    : 14
    python 13
    了 12
    Python 12

    发现个问题, 好多标点符号这些无用信息也统计上了

    2  人工去掉停顿

    # encoding=utf-8
    import jieba
    from click._compat import raw_input
    
    #  for stopWords
    # stopwords = []
    # for word in open("stopwords.txt", "r"):
    #     stopwords.append(word.strip()) stopwords = [line.strip() for line in open('stopwords.txt', 'r', encoding='utf-8').readlines()] # add extra stopword stopwords.append(''); stopwords.append(' ') stopwords.append('') stopwords.append(' ') stopwords.append(' ') # read need analyse file article = open("freq_wrod_test.txt", "r").read() words = jieba.cut(article, cut_all = False) # count word freq word_freq = {} for word in words: if (word in stopwords): continue if word in word_freq: word_freq[word] += 1 else: word_freq[word] = 1 # sorted freq_word = [] for word, freq in word_freq.items(): freq_word.append((word, freq)) freq_word.sort(key = lambda x:x[1], reverse=True) max_number = int(raw_input("需要前多少位高频词? ")) # display for word, freq in freq_word[:max_number]: print(word, freq)

     结果:

    需要前多少位高频词? 10
    编程 16
    python 13
    Python 12
    教程 9
    写 9
    项目 7
    代码 7
    程序 6
    经验 5
    找 5

    发现了两个python一个意思却分别进行统计了,如何统计同义词呢???

    3   去掉同义词

     大概思路就是,首先构造同义词字典,词间用tab间隔开,每一行的同义词都翻译为第一个词的意思。

    1 读取同义词字典。以行为单位,每一行的每一个词都存入字典,字典值为这一行的首个词

    2 判断某个词是否在字典内,如果在取字典中对应的词替换当前词

    # encoding=utf-8
    import jieba
    from click._compat import raw_input
    
    #  for stopWords
    # stopwords = []
    # for word in open("C:\Users\Luo Chen\Desktop\stop_words.txt", "r"):
    #     stopwords.append(word.strip())
    stopwords = [line.strip() for line in open('stopwords.txt', 'r', encoding='utf-8').readlines()]  
    #  add extra stopword
    stopwords.append('');
    stopwords.append('    ')  
    stopwords.append('') 
    stopwords.append('  ')
    stopwords.append(' ')
    # tongyici
    combine_dict = {} 
    for line in open("tongyici.txt", "r"):
        seperate_word = line.strip().split("	")
        num = len(seperate_word)
        for i in range(1, num):
            combine_dict[seperate_word[i]] = seperate_word[0]
            
    # read need analyse file
    article = open("freq_wrod_test.txt", "r").read()
    words = jieba.cut(article, cut_all = False)
    #  count word freq
    word_freq = {}
    for word in words:
        if word in stopwords:
            continue
        if word in combine_dict:
            word = combine_dict[word]
        if word in word_freq:
            word_freq[word] += 1
        else:
            word_freq[word] = 1
    #  sorted     
    freq_word = []
    for word, freq in word_freq.items():
        freq_word.append((word, freq))
    freq_word.sort(key = lambda x:x[1], reverse=True)
    max_number = int(raw_input("需要前多少位高频词? "))
    # display
    for word, freq in freq_word[:max_number]:
        print(word, freq)
    

     结果:

    同义词字典

  • 相关阅读:
    几个常用ORACLE运维监控的SQL语句
    Oracle 字符集的查看和修改
    建立表空间、用户、导出导入
    android get post提交与获取数据
    javaweb 路径问题
    java web分页
    用java调用oracle存储过程总结(转别人的。。。)
    java web乱码问题原因及解决方法
    servlet笔记
    C#中App目录
  • 原文地址:https://www.cnblogs.com/xzlq/p/8416581.html
Copyright © 2011-2022 走看看