zoukankan      html  css  js  c++  java
  • 对文本抽取词袋模型特征

     

    from sklearn.feature_extraction.text import CountVectorizer

    vec = CountVectorizer(

        analyzer='word',            # tokenise by character ngrams

        max_features=4000,     # keep the most common 4000 ngrams,表示抽取最常见的4000个单词

    #在x_train上提取词袋模型特征

    vec.fit(x_train)  

    classifier = MultinomialNB()

    # vec.transform(x_train)转化训练集样本,转变之后矩阵维度是[n_samples, 4000]

    classifier.fit(vec.transform(x_train), y_train)

    #加入抽取2-gram和3-gram的统计特征

    vec = CountVectorizer(

        analyzer='word',   # tokenise by character ngrams

        ngram_range=(1,4),  # use ngrams of size 1 and 2

    max_features=20000,)  # keep the most common 1000 ngrams

    更可靠的验证效果的方式是交叉验证,但是交叉验证最好保证每一份里面的样本类别也是相对均衡的,我们这里使用StratifiedKFold

    from sklearn.cross_validation import StratifiedKFold

    #x是训练数据,y是标签,train_index : test_index = 4:1

    stratifiedk_fold = StratifiedKFold(y, n_folds=n_folds, shuffle=shuffle)

        for train_index, test_index in stratifiedk_fold:

            X_train, X_test = x[train_index], x[test_index]

            y_train = y[train_index]

  • 相关阅读:
    SQLServer学习笔记系列3
    SQLServer学习笔记系列2
    逻辑回归的本质是最大似然估计
    机器学习基石-笔记2-转载
    机器学习基石-笔记1
    Spark核心原理
    Spark编程模型
    一个spark SQL和DataFrames的故事
    Spark Streaming + Kafka 整合向导之createDirectStream
    日志=>flume=>kafka=>spark streaming=>hbase
  • 原文地址:https://www.cnblogs.com/yongfuxue/p/10118993.html
Copyright © 2011-2022 走看看