zoukankan      html  css  js  c++  java
  • Classification of text documents: using a MLComp dataset

    注:原文代码链接http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html

    运行结果为:

    Loading 20 newsgroups training set... 
    20 newsgroups dataset for document classification (http://people.csail.mit.edu/jrennie/20Newsgroups)
    13180 documents
    20 categories
    Extracting features from the dataset using a sparse vectorizer
    done in 139.231000s
    n_samples: 13180, n_features: 130274
    Loading 20 newsgroups test set... 
    done in 0.000000s
    Predicting the labels of the test set...
    5648 documents
    20 categories
    Extracting features from the dataset using the same vectorizer
    done in 7.082000s
    n_samples: 5648, n_features: 130274
    Testbenching a linear classifier...
    parameters: {'penalty': 'l2', 'loss': 'hinge', 'alpha': 1e-05, 'fit_intercept': True, 'n_iter': 50}
    done in 22.012000s
    Percentage of non zeros coef: 30.074190
    Predicting the outcomes of the testing set
    done in 0.172000s
    Classification report on test set for classifier:
    SGDClassifier(alpha=1e-05, average=False, class_weight=None, epsilon=0.1,
           eta0=0.0, fit_intercept=True, l1_ratio=0.15,
           learning_rate='optimal', loss='hinge', n_iter=50, n_jobs=1,
           penalty='l2', power_t=0.5, random_state=None, shuffle=True,
           verbose=0, warm_start=False)
    
                              precision    recall  f1-score   support
    
                 alt.atheism       0.95      0.93      0.94       245
               comp.graphics       0.85      0.91      0.88       298
     comp.os.ms-windows.misc       0.88      0.88      0.88       292
    comp.sys.ibm.pc.hardware       0.82      0.80      0.81       301
       comp.sys.mac.hardware       0.90      0.92      0.91       256
              comp.windows.x       0.92      0.88      0.90       297
                misc.forsale       0.87      0.89      0.88       290
                   rec.autos       0.93      0.94      0.94       324
             rec.motorcycles       0.97      0.97      0.97       294
          rec.sport.baseball       0.97      0.97      0.97       315
            rec.sport.hockey       0.98      0.99      0.99       302
                   sci.crypt       0.97      0.96      0.96       297
             sci.electronics       0.87      0.89      0.88       313
                     sci.med       0.97      0.97      0.97       277
                   sci.space       0.97      0.97      0.97       305
      soc.religion.christian       0.95      0.96      0.95       293
          talk.politics.guns       0.94      0.94      0.94       246
       talk.politics.mideast       0.97      0.99      0.98       296
          talk.politics.misc       0.96      0.92      0.94       236
          talk.religion.misc       0.89      0.84      0.86       171
    
                 avg / total       0.93      0.93      0.93      5648
    
    Confusion matrix:
    [[227   0   0   0   0   0   0   0   0   0   0   1   2   1   1   1   0   1
        0  11]
     [  0 271   3   8   2   5   2   0   0   1   0   0   3   1   1   0   0   1
        0   0]
     [  0   7 256  14   5   6   1   0   0   0   0   0   2   0   1   0   0   0
        0   0]
     [  1   8  12 240   9   3  12   2   0   0   0   1  12   0   0   1   0   0
        0   0]
     [  0   1   3   6 235   2   4   0   0   0   0   1   3   0   1   0   0   0
        0   0]
     [  0  17   9   4   0 260   0   0   1   1   0   0   2   0   2   0   1   0
        0   0]
     [  0   1   3   7   3   0 257   7   2   0   0   1   8   0   1   0   0   0
        0   0]
     [  0   0   0   2   1   0   5 305   2   3   0   0   4   1   0   0   1   0
        0   0]
     [  0   0   0   0   1   0   3   3 285   0   0   0   1   0   0   1   0   0
        0   0]
     [  0   0   0   0   0   0   3   2   0 305   2   1   1   0   0   0   0   0
        1   0]
     [  0   0   0   0   0   0   1   0   1   0 300   0   0   0   0   0   0   0
        0   0]
     [  0   0   1   1   0   2   0   1   0   0   0 284   0   1   1   0   2   2
        1   1]
     [  0   2   2  10   2   2   6   5   1   0   1   1 279   1   1   0   0   0
        0   0]
     [  0   3   0   0   1   1   1   0   0   0   0   0   0 269   0   1   1   0
        0   0]
     [  0   5   0   0   1   0   0   0   0   0   2   0   1   0 295   0   0   0
        1   0]
     [  1   1   1   0   0   1   0   1   0   0   0   0   0   1   1 282   1   0
        0   3]
     [  0   0   1   0   0   0   0   0   1   3   0   0   1   0   0   1 232   1
        5   1]
     [  0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   2   0 293
        0   0]
     [  0   2   0   0   0   0   2   0   0   1   0   1   0   1   0   0   7   4
      216   2]
     [ 11   0   0   0   0   0   0   0   0   0   0   1   0   2   0   9   2   1
        2 143]]
    Testbenching a MultinomialNB classifier...
    parameters: {'alpha': 0.01}
    done in 0.608000s
    Percentage of non zeros coef: 100.000000
    Predicting the outcomes of the testing set
    done in 0.203000s
    Classification report on test set for classifier:
    MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)
    
                              precision    recall  f1-score   support
    
                 alt.atheism       0.90      0.92      0.91       245
               comp.graphics       0.81      0.89      0.85       298
     comp.os.ms-windows.misc       0.87      0.83      0.85       292
    comp.sys.ibm.pc.hardware       0.82      0.83      0.83       301
       comp.sys.mac.hardware       0.90      0.92      0.91       256
              comp.windows.x       0.90      0.89      0.89       297
                misc.forsale       0.90      0.84      0.87       290
                   rec.autos       0.93      0.94      0.93       324
             rec.motorcycles       0.98      0.97      0.97       294
          rec.sport.baseball       0.97      0.97      0.97       315
            rec.sport.hockey       0.97      0.99      0.98       302
                   sci.crypt       0.95      0.95      0.95       297
             sci.electronics       0.90      0.86      0.88       313
                     sci.med       0.97      0.96      0.97       277
                   sci.space       0.95      0.97      0.96       305
      soc.religion.christian       0.91      0.97      0.94       293
          talk.politics.guns       0.89      0.96      0.93       246
       talk.politics.mideast       0.95      0.98      0.97       296
          talk.politics.misc       0.93      0.87      0.90       236
          talk.religion.misc       0.92      0.74      0.82       171
    
                 avg / total       0.92      0.92      0.92      5648
    
    Confusion matrix:
    [[226   0   0   0   0   0   0   0   0   1   0   0   0   0   2   7   0   0
        0   9]
     [  1 266   7   4   1   6   2   2   0   0   0   3   4   1   1   0   0   0
        0   0]
     [  0  11 243  22   4   7   1   0   0   0   0   1   2   0   0   0   0   0
        1   0]
     [  0   7  12 250   8   4   9   0   0   1   1   0   9   0   0   0   0   0
        0   0]
     [  0   3   3   5 235   2   3   1   0   0   0   2   1   0   1   0   0   0
        0   0]
     [  0  19   5   3   2 263   0   0   0   0   0   1   0   1   1   0   2   0
        0   0]
     [  0   1   4   9   3   1 243   9   2   3   1   0   8   0   0   0   2   2
        2   0]
     [  0   0   0   1   1   0   5 304   1   2   0   0   3   2   3   1   1   0
        0   0]
     [  0   0   0   0   0   2   2   3 285   0   0   0   1   0   0   0   0   0
        0   1]
     [  0   1   0   0   0   1   1   3   0 304   5   0   0   0   0   0   0   0
        0   0]
     [  0   0   0   0   0   0   0   0   1   2 299   0   0   0   0   0   0   0
        0   0]
     [  0   2   2   1   0   1   2   0   0   0   0 283   1   0   0   0   2   1
        2   0]
     [  0  11   1   9   3   1   3   5   1   0   1   4 270   1   3   0   0   0
        0   0]
     [  0   2   0   1   1   1   0   0   0   0   0   1   0 266   2   1   0   0
        2   0]
     [  0   2   0   0   1   0   0   0   0   0   0   2   1   1 296   0   1   1
        0   0]
     [  3   1   0   0   0   0   0   0   0   0   1   0   0   2   0 283   0   1
        2   0]
     [  1   0   1   0   0   0   0   0   1   0   0   1   0   0   0   0 237   1
        3   1]
     [  1   0   0   0   0   1   0   0   0   0   0   0   0   0   0   3   0 291
        0   0]
     [  1   1   0   0   1   1   0   1   0   0   0   0   0   0   1   1  17   6
      206   0]
     [ 18   1   0   0   0   0   0   0   0   1   0   0   0   0   0  14   4   2
        4 127]]

     步骤为:

    一、preprocessing

    1.加载训练集(training set)

    2.训练集特征提取,用TfidfVectorizer,得到训练集上的x_train和y_train

    3.加载测试集(test set)

    4.测试集特征提取,用TfidfVectorizer得到测试集上的x_train和y_train

    二、定义Benchmark classifiers

    5.训练,clf = clf_class(**params).fit(X_train, y_train)

    6.测试,pred = clf.predict(X_test)

    7.测试集上分类报告,print(classification_report(y_test, pred,target_names=news_test.target_names))

    8.confusion matrix,cm = confusion_matrix(y_test, pred)

    三、训练

    9.调用两个分类器,SGDClassifier和MultinomialNB

     

  • 相关阅读:
    快手记录的面试题2
    快手Java实习一二面经(记录的面试题1)
    219. 存在重复元素 II(面试题也考过)
    117. 填充每个节点的下一个右侧节点指针 II(没想到,但是其实蛮简单的)
    116. 填充每个节点的下一个右侧节点指针
    最后来几个快手的面试题吧,先记录下来大概看看
    快手Java实习一二面面经(转载)
    双亲委派模型
    聚集索引与非聚集索引总结(转载)
    136. 只出现一次的数字
  • 原文地址:https://www.cnblogs.com/gui0901/p/4456267.html
Copyright © 2011-2022 走看看