zoukankan      html  css  js  c++  java
  • 词袋模型 测试数据使用和训练数据一样的词汇表

    def get_features_by_wordbag():
        global max_features
        x_train, x_test, y_train, y_test=load_all_files()
    
        vectorizer = CountVectorizer(
                                     decode_error='ignore',
                                     strip_accents='ascii',
                                     max_features=max_features,
                                     stop_words='english',
                                     max_df=1.0,
                                     min_df=1 )
        print vectorizer
        x_train=vectorizer.fit_transform(x_train)
        x_train=x_train.toarray()
        vocabulary=vectorizer.vocabulary_
    
        vectorizer = CountVectorizer(
                                     decode_error='ignore',
                                     strip_accents='ascii',
                                     vocabulary=vocabulary,
                                     stop_words='english',
                                     max_df=1.0,
                                     min_df=1 )
        print vectorizer
        x_test=vectorizer.fit_transform(x_test)
        x_test=x_test.toarray()
    
        return x_train, x_test, y_train, y_test

     词袋模型示例:

    >>> corpus = [
    ...     'This is the first document.',
    ...     'This is the second second document.',
    ...     'And the third one.',
    ...     'Is this the first document?',
    ... ]
    >>> X = vectorizer.fit_transform(corpus)
    >>> X                              
    <4x9 sparse matrix of type '<... 'numpy.int64'>'
        with 19 stored elements in Compressed Sparse ... format>
    

    The default configuration tokenizes the string by extracting words of at least 2 letters. The specific function that does this step can be requested explicitly:

    >>>
    >>> analyze = vectorizer.build_analyzer()
    >>> analyze("This is a text document to analyze.") == (
    ...     ['this', 'is', 'text', 'document', 'to', 'analyze'])
    True
    

    Each term found by the analyzer during the fit is assigned a unique integer index corresponding to a column in the resulting matrix. This interpretation of the columns can be retrieved as follows:

    >>>
    >>> vectorizer.get_feature_names() == (
    ...     ['and', 'document', 'first', 'is', 'one',
    ...      'second', 'the', 'third', 'this'])
    True
    
    >>> X.toarray()           
    array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
           [0, 1, 0, 1, 0, 2, 1, 0, 1],
           [1, 0, 0, 0, 1, 0, 1, 1, 0],
           [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)
    
  • 相关阅读:
    hdu 3033 I love sneakers!
    poj 1742 Coins
    poj 1276 Cash Machine
    hdu 1114 Piggy-Bank
    poj 1293 Duty Free Shop
    hdu 1203 I NEED A OFFER!
    hdu 2546 饭卡
    树的直径
    CF 337D Book of Evil
    ST表
  • 原文地址:https://www.cnblogs.com/bonelee/p/8426002.html
Copyright © 2011-2022 走看看