zoukankan      html  css  js  c++  java
  • 词袋模型 测试数据使用和训练数据一样的词汇表

    def get_features_by_wordbag():
        global max_features
        x_train, x_test, y_train, y_test=load_all_files()
    
        vectorizer = CountVectorizer(
                                     decode_error='ignore',
                                     strip_accents='ascii',
                                     max_features=max_features,
                                     stop_words='english',
                                     max_df=1.0,
                                     min_df=1 )
        print vectorizer
        x_train=vectorizer.fit_transform(x_train)
        x_train=x_train.toarray()
        vocabulary=vectorizer.vocabulary_
    
        vectorizer = CountVectorizer(
                                     decode_error='ignore',
                                     strip_accents='ascii',
                                     vocabulary=vocabulary,
                                     stop_words='english',
                                     max_df=1.0,
                                     min_df=1 )
        print vectorizer
        x_test=vectorizer.fit_transform(x_test)
        x_test=x_test.toarray()
    
        return x_train, x_test, y_train, y_test

     词袋模型示例:

    >>> corpus = [
    ...     'This is the first document.',
    ...     'This is the second second document.',
    ...     'And the third one.',
    ...     'Is this the first document?',
    ... ]
    >>> X = vectorizer.fit_transform(corpus)
    >>> X                              
    <4x9 sparse matrix of type '<... 'numpy.int64'>'
        with 19 stored elements in Compressed Sparse ... format>
    

    The default configuration tokenizes the string by extracting words of at least 2 letters. The specific function that does this step can be requested explicitly:

    >>>
    >>> analyze = vectorizer.build_analyzer()
    >>> analyze("This is a text document to analyze.") == (
    ...     ['this', 'is', 'text', 'document', 'to', 'analyze'])
    True
    

    Each term found by the analyzer during the fit is assigned a unique integer index corresponding to a column in the resulting matrix. This interpretation of the columns can be retrieved as follows:

    >>>
    >>> vectorizer.get_feature_names() == (
    ...     ['and', 'document', 'first', 'is', 'one',
    ...      'second', 'the', 'third', 'this'])
    True
    
    >>> X.toarray()           
    array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
           [0, 1, 0, 1, 0, 2, 1, 0, 1],
           [1, 0, 0, 0, 1, 0, 1, 1, 0],
           [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)
    
  • 相关阅读:
    HDU 1022 Train Problem I
    HDU 1702 ACboy needs your help again!
    HDU 1294 Rooted Trees Problem
    HDU 1027 Ignatius and the Princess II
    HDU 3398 String
    HDU 1709 The Balance
    HDU 2152 Fruit
    HDU 1398 Square Coins
    HDU 3571 N-dimensional Sphere
    HDU 2451 Simple Addition Expression
  • 原文地址:https://www.cnblogs.com/bonelee/p/8426002.html
Copyright © 2011-2022 走看看