zoukankan      html  css  js  c++  java
  • Classification of text documents: using a MLComp dataset

    注:原文代码链接http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html

    运行结果为:

    Loading 20 newsgroups training set... 
    20 newsgroups dataset for document classification (http://people.csail.mit.edu/jrennie/20Newsgroups)
    13180 documents
    20 categories
    Extracting features from the dataset using a sparse vectorizer
    done in 139.231000s
    n_samples: 13180, n_features: 130274
    Loading 20 newsgroups test set... 
    done in 0.000000s
    Predicting the labels of the test set...
    5648 documents
    20 categories
    Extracting features from the dataset using the same vectorizer
    done in 7.082000s
    n_samples: 5648, n_features: 130274
    Testbenching a linear classifier...
    parameters: {'penalty': 'l2', 'loss': 'hinge', 'alpha': 1e-05, 'fit_intercept': True, 'n_iter': 50}
    done in 22.012000s
    Percentage of non zeros coef: 30.074190
    Predicting the outcomes of the testing set
    done in 0.172000s
    Classification report on test set for classifier:
    SGDClassifier(alpha=1e-05, average=False, class_weight=None, epsilon=0.1,
           eta0=0.0, fit_intercept=True, l1_ratio=0.15,
           learning_rate='optimal', loss='hinge', n_iter=50, n_jobs=1,
           penalty='l2', power_t=0.5, random_state=None, shuffle=True,
           verbose=0, warm_start=False)
    
                              precision    recall  f1-score   support
    
                 alt.atheism       0.95      0.93      0.94       245
               comp.graphics       0.85      0.91      0.88       298
     comp.os.ms-windows.misc       0.88      0.88      0.88       292
    comp.sys.ibm.pc.hardware       0.82      0.80      0.81       301
       comp.sys.mac.hardware       0.90      0.92      0.91       256
              comp.windows.x       0.92      0.88      0.90       297
                misc.forsale       0.87      0.89      0.88       290
                   rec.autos       0.93      0.94      0.94       324
             rec.motorcycles       0.97      0.97      0.97       294
          rec.sport.baseball       0.97      0.97      0.97       315
            rec.sport.hockey       0.98      0.99      0.99       302
                   sci.crypt       0.97      0.96      0.96       297
             sci.electronics       0.87      0.89      0.88       313
                     sci.med       0.97      0.97      0.97       277
                   sci.space       0.97      0.97      0.97       305
      soc.religion.christian       0.95      0.96      0.95       293
          talk.politics.guns       0.94      0.94      0.94       246
       talk.politics.mideast       0.97      0.99      0.98       296
          talk.politics.misc       0.96      0.92      0.94       236
          talk.religion.misc       0.89      0.84      0.86       171
    
                 avg / total       0.93      0.93      0.93      5648
    
    Confusion matrix:
    [[227   0   0   0   0   0   0   0   0   0   0   1   2   1   1   1   0   1
        0  11]
     [  0 271   3   8   2   5   2   0   0   1   0   0   3   1   1   0   0   1
        0   0]
     [  0   7 256  14   5   6   1   0   0   0   0   0   2   0   1   0   0   0
        0   0]
     [  1   8  12 240   9   3  12   2   0   0   0   1  12   0   0   1   0   0
        0   0]
     [  0   1   3   6 235   2   4   0   0   0   0   1   3   0   1   0   0   0
        0   0]
     [  0  17   9   4   0 260   0   0   1   1   0   0   2   0   2   0   1   0
        0   0]
     [  0   1   3   7   3   0 257   7   2   0   0   1   8   0   1   0   0   0
        0   0]
     [  0   0   0   2   1   0   5 305   2   3   0   0   4   1   0   0   1   0
        0   0]
     [  0   0   0   0   1   0   3   3 285   0   0   0   1   0   0   1   0   0
        0   0]
     [  0   0   0   0   0   0   3   2   0 305   2   1   1   0   0   0   0   0
        1   0]
     [  0   0   0   0   0   0   1   0   1   0 300   0   0   0   0   0   0   0
        0   0]
     [  0   0   1   1   0   2   0   1   0   0   0 284   0   1   1   0   2   2
        1   1]
     [  0   2   2  10   2   2   6   5   1   0   1   1 279   1   1   0   0   0
        0   0]
     [  0   3   0   0   1   1   1   0   0   0   0   0   0 269   0   1   1   0
        0   0]
     [  0   5   0   0   1   0   0   0   0   0   2   0   1   0 295   0   0   0
        1   0]
     [  1   1   1   0   0   1   0   1   0   0   0   0   0   1   1 282   1   0
        0   3]
     [  0   0   1   0   0   0   0   0   1   3   0   0   1   0   0   1 232   1
        5   1]
     [  0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   2   0 293
        0   0]
     [  0   2   0   0   0   0   2   0   0   1   0   1   0   1   0   0   7   4
      216   2]
     [ 11   0   0   0   0   0   0   0   0   0   0   1   0   2   0   9   2   1
        2 143]]
    Testbenching a MultinomialNB classifier...
    parameters: {'alpha': 0.01}
    done in 0.608000s
    Percentage of non zeros coef: 100.000000
    Predicting the outcomes of the testing set
    done in 0.203000s
    Classification report on test set for classifier:
    MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)
    
                              precision    recall  f1-score   support
    
                 alt.atheism       0.90      0.92      0.91       245
               comp.graphics       0.81      0.89      0.85       298
     comp.os.ms-windows.misc       0.87      0.83      0.85       292
    comp.sys.ibm.pc.hardware       0.82      0.83      0.83       301
       comp.sys.mac.hardware       0.90      0.92      0.91       256
              comp.windows.x       0.90      0.89      0.89       297
                misc.forsale       0.90      0.84      0.87       290
                   rec.autos       0.93      0.94      0.93       324
             rec.motorcycles       0.98      0.97      0.97       294
          rec.sport.baseball       0.97      0.97      0.97       315
            rec.sport.hockey       0.97      0.99      0.98       302
                   sci.crypt       0.95      0.95      0.95       297
             sci.electronics       0.90      0.86      0.88       313
                     sci.med       0.97      0.96      0.97       277
                   sci.space       0.95      0.97      0.96       305
      soc.religion.christian       0.91      0.97      0.94       293
          talk.politics.guns       0.89      0.96      0.93       246
       talk.politics.mideast       0.95      0.98      0.97       296
          talk.politics.misc       0.93      0.87      0.90       236
          talk.religion.misc       0.92      0.74      0.82       171
    
                 avg / total       0.92      0.92      0.92      5648
    
    Confusion matrix:
    [[226   0   0   0   0   0   0   0   0   1   0   0   0   0   2   7   0   0
        0   9]
     [  1 266   7   4   1   6   2   2   0   0   0   3   4   1   1   0   0   0
        0   0]
     [  0  11 243  22   4   7   1   0   0   0   0   1   2   0   0   0   0   0
        1   0]
     [  0   7  12 250   8   4   9   0   0   1   1   0   9   0   0   0   0   0
        0   0]
     [  0   3   3   5 235   2   3   1   0   0   0   2   1   0   1   0   0   0
        0   0]
     [  0  19   5   3   2 263   0   0   0   0   0   1   0   1   1   0   2   0
        0   0]
     [  0   1   4   9   3   1 243   9   2   3   1   0   8   0   0   0   2   2
        2   0]
     [  0   0   0   1   1   0   5 304   1   2   0   0   3   2   3   1   1   0
        0   0]
     [  0   0   0   0   0   2   2   3 285   0   0   0   1   0   0   0   0   0
        0   1]
     [  0   1   0   0   0   1   1   3   0 304   5   0   0   0   0   0   0   0
        0   0]
     [  0   0   0   0   0   0   0   0   1   2 299   0   0   0   0   0   0   0
        0   0]
     [  0   2   2   1   0   1   2   0   0   0   0 283   1   0   0   0   2   1
        2   0]
     [  0  11   1   9   3   1   3   5   1   0   1   4 270   1   3   0   0   0
        0   0]
     [  0   2   0   1   1   1   0   0   0   0   0   1   0 266   2   1   0   0
        2   0]
     [  0   2   0   0   1   0   0   0   0   0   0   2   1   1 296   0   1   1
        0   0]
     [  3   1   0   0   0   0   0   0   0   0   1   0   0   2   0 283   0   1
        2   0]
     [  1   0   1   0   0   0   0   0   1   0   0   1   0   0   0   0 237   1
        3   1]
     [  1   0   0   0   0   1   0   0   0   0   0   0   0   0   0   3   0 291
        0   0]
     [  1   1   0   0   1   1   0   1   0   0   0   0   0   0   1   1  17   6
      206   0]
     [ 18   1   0   0   0   0   0   0   0   1   0   0   0   0   0  14   4   2
        4 127]]

     步骤为:

    一、preprocessing

    1.加载训练集(training set)

    2.训练集特征提取,用TfidfVectorizer,得到训练集上的x_train和y_train

    3.加载测试集(test set)

    4.测试集特征提取,用TfidfVectorizer得到测试集上的x_train和y_train

    二、定义Benchmark classifiers

    5.训练,clf = clf_class(**params).fit(X_train, y_train)

    6.测试,pred = clf.predict(X_test)

    7.测试集上分类报告,print(classification_report(y_test, pred,target_names=news_test.target_names))

    8.confusion matrix,cm = confusion_matrix(y_test, pred)

    三、训练

    9.调用两个分类器,SGDClassifier和MultinomialNB

     

  • 相关阅读:
    django模板语言之Template
    python基础知识回顾之字符串
    在命令行中创建Django项目
    Ubuntu如何开启SSH服务
    跨过Django的坑
    python基础知识回顾之元组
    python基础知识回顾之列表
    SQL Server登录方式
    k8s或者docker容器在安装服务之后,用systemctl启动会报错:connection: Operation not permitted
    记一次oracle视图查询失效的情况,ERROR at line 1: ORA-04045: errors during recompilation/revalidation of NC633.BB_API_BUDGET_EXEINFO ORA-16000: database open for read-only access
  • 原文地址:https://www.cnblogs.com/gui0901/p/4456267.html
Copyright © 2011-2022 走看看