zoukankan      html  css  js  c++  java
  • 词袋模型 测试数据使用和训练数据一样的词汇表

    def get_features_by_wordbag():
        global max_features
        x_train, x_test, y_train, y_test=load_all_files()
    
        vectorizer = CountVectorizer(
                                     decode_error='ignore',
                                     strip_accents='ascii',
                                     max_features=max_features,
                                     stop_words='english',
                                     max_df=1.0,
                                     min_df=1 )
        print vectorizer
        x_train=vectorizer.fit_transform(x_train)
        x_train=x_train.toarray()
        vocabulary=vectorizer.vocabulary_
    
        vectorizer = CountVectorizer(
                                     decode_error='ignore',
                                     strip_accents='ascii',
                                     vocabulary=vocabulary,
                                     stop_words='english',
                                     max_df=1.0,
                                     min_df=1 )
        print vectorizer
        x_test=vectorizer.fit_transform(x_test)
        x_test=x_test.toarray()
    
        return x_train, x_test, y_train, y_test

     词袋模型示例:

    >>> corpus = [
    ...     'This is the first document.',
    ...     'This is the second second document.',
    ...     'And the third one.',
    ...     'Is this the first document?',
    ... ]
    >>> X = vectorizer.fit_transform(corpus)
    >>> X                              
    <4x9 sparse matrix of type '<... 'numpy.int64'>'
        with 19 stored elements in Compressed Sparse ... format>
    

    The default configuration tokenizes the string by extracting words of at least 2 letters. The specific function that does this step can be requested explicitly:

    >>>
    >>> analyze = vectorizer.build_analyzer()
    >>> analyze("This is a text document to analyze.") == (
    ...     ['this', 'is', 'text', 'document', 'to', 'analyze'])
    True
    

    Each term found by the analyzer during the fit is assigned a unique integer index corresponding to a column in the resulting matrix. This interpretation of the columns can be retrieved as follows:

    >>>
    >>> vectorizer.get_feature_names() == (
    ...     ['and', 'document', 'first', 'is', 'one',
    ...      'second', 'the', 'third', 'this'])
    True
    
    >>> X.toarray()           
    array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
           [0, 1, 0, 1, 0, 2, 1, 0, 1],
           [1, 0, 0, 0, 1, 0, 1, 1, 0],
           [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)
    
  • 相关阅读:
    vue cli 3.0安装、项目创建
    Vue-详解设置路由导航的两种方法
    VUE项目启动流程
    vue项目创建
    前端UI优秀框架
    Spring MVC返回JSON的几种方法
    Cookie 和 Session 的区别
    Object.defineProperty()
    vuex getter传入参数
    后台管理系统权限控制的思路
  • 原文地址:https://www.cnblogs.com/bonelee/p/8426002.html
Copyright © 2011-2022 走看看