zoukankan      html  css  js  c++  java
  • 文本向量化

    from sklearn.feature_extraction.text import CountVectorizer
    from nltk.corpus import stopwords
    
    stop_list = list(set(stopwords.words('english')))  # set()集合函数消除重复项
    
    corpus = ['This is the first document.',    # 语料库
              'This is the second second document.',
              'And the third one.',
              'Is this the first document?']
    
    # -----------------------------------
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(corpus)    # 向量化,得到词袋模型
    
    print(X.toarray())
    print(vectorizer.get_feature_names())
    
    print()
    # -----------------------------------
    
    bigram_vectorizer = CountVectorizer(ngram_range=(1,3),  # N元特征
                                        stop_words = stop_list) # 停用词
    X = bigram_vectorizer.fit_transform(corpus)
    
    print(X.toarray())
    print(bigram_vectorizer.get_feature_names())
    
    print()
    # ------------------------------------
    
    analyze = vectorizer.build_analyzer()
    print(analyze('This is a text document to analyze.'))
    
    print(vectorizer.transform(['something completely new.',
                                'and this has something old.']).toarray())
    
    [[0 1 1 1 0 0 1 0 1]
     [0 1 0 1 0 2 1 0 1]
     [1 0 0 0 1 0 1 1 0]
     [0 1 1 1 0 0 1 0 1]]
    ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
    
    [[1 1 1 0 0 0 0 0 0 0]
     [1 0 0 0 2 1 1 1 0 0]
     [0 0 0 1 0 0 0 0 1 1]
     [1 1 1 0 0 0 0 0 0 0]]
    ['document', 'first', 'first document', 'one', 'second', 'second document', 'second second', 'second second document', 'third', 'third one']
    
    ['this', 'is', 'text', 'document', 'to', 'analyze']
    [[0 0 0 0 0 0 0 0 0]
     [1 0 0 0 0 0 0 0 1]]
    

    CountVectorizer和TfidfVectorizer的参数:https://blog.csdn.net/du_qi/article/details/51564303
    stopwords:https://www.cnblogs.com/webRobot/p/6079919.html

  • 相关阅读:
    python汉诺塔
    圆周率计算
    PIL: 建立一个GIF图
    Jieba库使用和好玩的词云
    Turtle库的建立——汉诺塔
    计算pi的精度+进度条显示
    Python——我所学习的turtle函数库
    Python——教你画朵太阳花
    Python常用模块re的使用
    正则表达式字符组/元字符/量词
  • 原文地址:https://www.cnblogs.com/holaworld/p/12510477.html
Copyright © 2011-2022 走看看