zoukankan      html  css  js  c++  java
  • 词袋模型 测试数据使用和训练数据一样的词汇表

    def get_features_by_wordbag():
        global max_features
        x_train, x_test, y_train, y_test=load_all_files()
    
        vectorizer = CountVectorizer(
                                     decode_error='ignore',
                                     strip_accents='ascii',
                                     max_features=max_features,
                                     stop_words='english',
                                     max_df=1.0,
                                     min_df=1 )
        print vectorizer
        x_train=vectorizer.fit_transform(x_train)
        x_train=x_train.toarray()
        vocabulary=vectorizer.vocabulary_
    
        vectorizer = CountVectorizer(
                                     decode_error='ignore',
                                     strip_accents='ascii',
                                     vocabulary=vocabulary,
                                     stop_words='english',
                                     max_df=1.0,
                                     min_df=1 )
        print vectorizer
        x_test=vectorizer.fit_transform(x_test)
        x_test=x_test.toarray()
    
        return x_train, x_test, y_train, y_test

     词袋模型示例:

    >>> corpus = [
    ...     'This is the first document.',
    ...     'This is the second second document.',
    ...     'And the third one.',
    ...     'Is this the first document?',
    ... ]
    >>> X = vectorizer.fit_transform(corpus)
    >>> X                              
    <4x9 sparse matrix of type '<... 'numpy.int64'>'
        with 19 stored elements in Compressed Sparse ... format>
    

    The default configuration tokenizes the string by extracting words of at least 2 letters. The specific function that does this step can be requested explicitly:

    >>>
    >>> analyze = vectorizer.build_analyzer()
    >>> analyze("This is a text document to analyze.") == (
    ...     ['this', 'is', 'text', 'document', 'to', 'analyze'])
    True
    

    Each term found by the analyzer during the fit is assigned a unique integer index corresponding to a column in the resulting matrix. This interpretation of the columns can be retrieved as follows:

    >>>
    >>> vectorizer.get_feature_names() == (
    ...     ['and', 'document', 'first', 'is', 'one',
    ...      'second', 'the', 'third', 'this'])
    True
    
    >>> X.toarray()           
    array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
           [0, 1, 0, 1, 0, 2, 1, 0, 1],
           [1, 0, 0, 0, 1, 0, 1, 1, 0],
           [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)
    
  • 相关阅读:
    c++ 编译时检测结构体大小的的宏定义写法
    文本格式转换的网站,云转换
    chm格式文件,win7下用c:/windows/hh.exe打开
    visual paradigm 自动解析代码生成 UML图
    proxifiler 代理神器
    linux下设置 git ssh 代理
    一直出现 Enter passphrase for key '/root/.ssh/gitkey12.pub'
    connect-proxy rpm文件的安装
    [转] ssh免密码登录服务器
    [转] 公司局域网中代码访问 github.com
  • 原文地址:https://www.cnblogs.com/bonelee/p/8426002.html
Copyright © 2011-2022 走看看