zoukankan      html  css  js  c++  java
  • NLP整体流程的代码

    import nltk
    import numpy as np
    import re
    from nltk.corpus import stopwords
    
    # 1 分词1
    text = "Sentiment analysis is a challenging subject in machine learning.
     People express their emotions in language that is often obscured by sarcasm,
      ambiguity, and plays on words, all of which could be very misleading for 
      both humans and computers. There's another Kaggle competition for movie review 
      sentiment analysis. In this tutorial we explore how Word2Vec can be applied to 
      a similar problem.".lower()
    
    text_list = nltk.word_tokenize(text)
    
    #2 q去掉标点符号和停用词
    #去掉标点符号
    english_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%']
    text_list = [word for word in text_list if word not in english_punctuations]
    #去掉停用词
    stops = set(stopwords.words("english"))
    text_list = [word for word in text_list if word not in stops]
    
    #3统计词频
    freq_dist = nltk.FreqDist(text_list)
    freq_list = []
    num_words = len(freq_dist.values())
    for i in range(num_words):
        freq_list.append([list(freq_dist.keys())[i],list(freq_dist.values())[i]])
    freqArr = np.array(freq_list)
    print(freqArr)
    
    #4词性标注
    print(nltk.pos_tag(text_list))
    

      

  • 相关阅读:
    第一章
    第一章 计算机系统漫游
    hihocoder #1014 : Trie树
    第一章
    来个小目标
    poj 1056 IMMEDIATE DECODABILITY
    poj 2001 Shortest Prefixes
    __name__ 指示模块应如何被加载
    Python 常用函数time.strftime()简介
    CentOS安装beEF做XSS平台
  • 原文地址:https://www.cnblogs.com/elpsycongroo/p/9369420.html
Copyright © 2011-2022 走看看