zoukankan      html  css  js  c++  java
  • 词袋模型 测试数据使用和训练数据一样的词汇表

    def get_features_by_wordbag():
        global max_features
        x_train, x_test, y_train, y_test=load_all_files()
    
        vectorizer = CountVectorizer(
                                     decode_error='ignore',
                                     strip_accents='ascii',
                                     max_features=max_features,
                                     stop_words='english',
                                     max_df=1.0,
                                     min_df=1 )
        print vectorizer
        x_train=vectorizer.fit_transform(x_train)
        x_train=x_train.toarray()
        vocabulary=vectorizer.vocabulary_
    
        vectorizer = CountVectorizer(
                                     decode_error='ignore',
                                     strip_accents='ascii',
                                     vocabulary=vocabulary,
                                     stop_words='english',
                                     max_df=1.0,
                                     min_df=1 )
        print vectorizer
        x_test=vectorizer.fit_transform(x_test)
        x_test=x_test.toarray()
    
        return x_train, x_test, y_train, y_test

     词袋模型示例:

    >>> corpus = [
    ...     'This is the first document.',
    ...     'This is the second second document.',
    ...     'And the third one.',
    ...     'Is this the first document?',
    ... ]
    >>> X = vectorizer.fit_transform(corpus)
    >>> X                              
    <4x9 sparse matrix of type '<... 'numpy.int64'>'
        with 19 stored elements in Compressed Sparse ... format>
    

    The default configuration tokenizes the string by extracting words of at least 2 letters. The specific function that does this step can be requested explicitly:

    >>>
    >>> analyze = vectorizer.build_analyzer()
    >>> analyze("This is a text document to analyze.") == (
    ...     ['this', 'is', 'text', 'document', 'to', 'analyze'])
    True
    

    Each term found by the analyzer during the fit is assigned a unique integer index corresponding to a column in the resulting matrix. This interpretation of the columns can be retrieved as follows:

    >>>
    >>> vectorizer.get_feature_names() == (
    ...     ['and', 'document', 'first', 'is', 'one',
    ...      'second', 'the', 'third', 'this'])
    True
    
    >>> X.toarray()           
    array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
           [0, 1, 0, 1, 0, 2, 1, 0, 1],
           [1, 0, 0, 0, 1, 0, 1, 1, 0],
           [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)
    
  • 相关阅读:
    如何解决Unsupported major.minor version 52.0问题?
    无法将 Ethernet0 连接到虚拟网络”VMnet0″ 详细信息可以在 vmware.log 文件中找到未能连接虚拟机Ethernet0
    安卓逆向入门教程(一)
    go数据类型 slice map
    Tomcat服务配置及性能优化
    RabbitMQ消息可靠性投递
    LevelDb引擎
    【前端】Vue.js前端框架
    【PHP】PHP 微服务协程框架Swoft
    Supervisor进程管理工具
  • 原文地址:https://www.cnblogs.com/bonelee/p/8426002.html
Copyright © 2011-2022 走看看