zoukankan      html  css  js  c++  java
  • NLP(十七) 利用DNN对Email分类

    原文链接:http://www.one2know.cn/nlp17/

    • 数据集
      scikit-learn中20个新闻组,总邮件18846,训练集11314,测试集7532,类别20
    from sklearn.datasets import fetch_20newsgroups
    newsgroups_train = fetch_20newsgroups(subset='train')
    newsgroups_test = fetch_20newsgroups(subset='test')
    x_train = newsgroups_train.data
    x_test = newsgroups_test.data
    y_train = newsgroups_train.target
    y_test = newsgroups_test.target
    print('List of all 20 categories:')
    print(newsgroups_train.target_names,'
    ')
    print('Sample Email:')
    print(x_train[0])
    print('Sample Target Category:')
    print(y_train[0])
    print(newsgroups_train.target_names[y_train[0]])
    

    输出:

    List of all 20 categories:
    ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc'] 
    
    Sample Email:
    From: lerxst@wam.umd.edu (where's my thing)
    Subject: WHAT car is this!?
    Nntp-Posting-Host: rac3.wam.umd.edu
    Organization: University of Maryland, College Park
    Lines: 15
    
     I was wondering if anyone out there could enlighten me on this car I saw
    the other day. It was a 2-door sports car, looked to be from the late 60s/
    early 70s. It was called a Bricklin. The doors were really small. In addition,
    the front bumper was separate from the rest of the body. This is 
    all I know. If anyone can tellme a model name, engine specs, years
    of production, where this car is made, history, or whatever info you
    have on this funky looking car, please e-mail.
    
    Thanks,
    - IL
       ---- brought to you by your neighborhood Lerxst ----
    
    • 实现步骤
    1. 预处理
      1)去标点符号
      2)分词
      3)单词都转化成小写
      4)去停用词
      5)保留长度至少为3的词
      6)提取词干
      7)词性标注
      8)词形还原
    2. TF-IDF向量转换
    3. 深度学习模型的训练和测试
    4. 模型评估和结果分析
    • 代码
    from sklearn.datasets import fetch_20newsgroups
    newsgroups_train = fetch_20newsgroups(subset='train')
    newsgroups_test = fetch_20newsgroups(subset='test')
    x_train = newsgroups_train.data
    x_test = newsgroups_test.data
    y_train = newsgroups_train.target
    y_test = newsgroups_test.target
    # print('List of all 20 categories:')
    # print(newsgroups_train.target_names,'
    ')
    # print('Sample Email:')
    # print(x_train[0])
    # print('Sample Target Category:')
    # print(y_train[0])
    # print(newsgroups_train.target_names[y_train[0]])
    
    import nltk
    from nltk.corpus import stopwords
    from nltk.stem import WordNetLemmatizer
    import string
    import pandas as pd
    from nltk import  pos_tag
    from nltk.stem import PorterStemmer
    
    def preprocessing(text):
        # 标点都换成空格,再以空格分割,在以空格为分割合并所以元素
        text2 = ' '.join(''.join([' ' if ch in string.punctuation else ch for ch in text]).split())
        # 分词
        tokens = [word for sent in nltk.sent_tokenize(text2) for word in nltk.word_tokenize(sent)]
        tokens = [word.lower() for word in tokens]
        stopwds = stopwords.words('english')
        # 过滤掉 停用词 和 长度<3 的token
        tokens = [token for token in tokens if token not in stopwds and len(token) >= 3]
        # 词干提取
        stemmer = PorterStemmer()
        tokens = [stemmer.stem(word) for word in tokens]
        # 词性标注
        tagged_corpus = pos_tag(tokens)
        Noun_tags = ['NN','NNP','NNPS','NNS'] # 普通名词 专有名词 专有名词复数 普通名词复数
        Verb_tags = ['VB','VBD','VBG','VBN','VBP','VBZ']
        # 动词 动词过去式 动词现在分词 动词过去分词  动词现在时 动词现在时第三人称单数
        lemmatizer = WordNetLemmatizer()
        def prat_lemmatize(token,tag):
            if tag in Noun_tags:
                return lemmatizer.lemmatize(token,'n')
            elif tag in Verb_tags:
                return lemmatizer.lemmatize(token,'v')
            else:
                return lemmatizer.lemmatize(token,'n')
        pre_proc_text = ' '.join([prat_lemmatize(token,tag) for token,tag in tagged_corpus])
        return pre_proc_text
    
    # 处理数据集
    x_train_preprocessed = []
    for i in x_train:
        x_train_preprocessed.append(preprocessing(i))
    x_test_preprocessed = []
    for i in x_test:
        x_test_preprocessed.append(preprocessing(i))
    
    # 得到每个文档的TF-IDF向量
    from sklearn.feature_extraction.text import TfidfVectorizer
    vectorizer = TfidfVectorizer(min_df=2,ngram_range=(1,2),stop_words='english',
                                 max_features=10000,strip_accents='unicode',norm='l2')
    x_train_2 = vectorizer.fit_transform(x_train_preprocessed).todense() # 稀疏矩阵=>密集!?
    x_test_2 = vectorizer.transform(x_test_preprocessed).todense()
    
    # 导入深度学习模块
    import numpy as np
    from keras.models import Sequential
    from keras.layers.core import Dense,Dropout,Activation
    from keras.optimizers import Adadelta,Adam,RMSprop
    from keras.utils import np_utils
    
    np.random.seed(0)
    nb_classes = 20
    batch_size = 64 # 批尺寸
    nb_epochs = 20 # 迭代次数
    
    # 将20个类变成one-hot编码向量
    Y_train = np_utils.to_categorical(y_train,nb_classes)
    
    # 建立keras模型 3个隐藏层 神经元个数分别为1000 500 50,每层dropout均为50%,优化算法为Adam
    model = Sequential()
    model.add(Dense(1000,input_shape=(10000,)))
    model.add(Activation('relu'))
    model.add(Dropout(0.5))
    model.add(Dense(500))
    model.add(Activation('relu'))
    model.add(Dropout(0.5))
    model.add(Dense(50))
    model.add(Activation('relu'))
    model.add(Dropout(0.5))
    model.add(Dense(nb_classes))
    model.add(Activation('softmax'))
    model.compile(loss='categorical_crossentropy',optimizer='adam')
    # loss=交叉熵损失函数 optimizer优化程序=adam
    print(model.summary())
    
    # 模型训练
    model.fit(x_train_2,Y_train,batch_size=batch_size,epochs=nb_epochs,verbose=1)
    
    # 模型预测
    y_train_predclass = model.predict_classes(x_train_2,batch_size=batch_size)
    y_test_preclass = model.predict_classes(x_test_2,batch_size==batch_size)
    from sklearn.metrics import accuracy_score,classification_report
    print("
    
    Deep Neural Network - Train accuracy:",round(accuracy_score(y_train,y_train_predclass),3))
    print("
    Deep Neural Network - Test accuracy:",round(accuracy_score(y_test,y_test_preclass),3))
    print("
    Deep Neural Network - Train Classification Report")
    print(classification_report(y_train,y_train_predclass))
    print("
    Deep Neural Network - Test Classification Report")
    print(classification_report(y_test,y_test_preclass))
    

    输出:

    Using TensorFlow backend.
    WARNING:tensorflow:From 
    D:Python37Libsite-packages	ensorflowpythonframeworkop_def_library.py:263: 
    colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a 
    future version.
    Instructions for updating:
    Colocations handled automatically by placer.
    WARNING:tensorflow:From 
    D:Anaconda3libsite-packageskerasackend	ensorflow_backend.py:3445: calling dropout 
    (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a 
    future version.
    Instructions for updating:
    Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #   
    
    =================================================================
    dense_1 (Dense)              (None, 1000)              10001000  
    _________________________________________________________________
    
    activation_1 (Activation)    (None, 1000)              0         
    _________________________________________________________________
    
    dropout_1 (Dropout)          (None, 1000)              0         
    _________________________________________________________________
    
    dense_2 (Dense)              (None, 500)               500500    
    _________________________________________________________________
    
    activation_2 (Activation)    (None, 500)               0         
    _________________________________________________________________
    
    dropout_2 (Dropout)          (None, 500)               0         
    _________________________________________________________________
    
    dense_3 (Dense)              (None, 50)                25050     
    _________________________________________________________________
    
    activation_3 (Activation)    (None, 50)                0         
    _________________________________________________________________
    
    dropout_3 (Dropout)          (None, 50)                0         
    _________________________________________________________________
    
    dense_4 (Dense)              (None, 20)                1020      
    _________________________________________________________________
    activation_4 (Activation)    (None, 20)                0 
    
    =================================================================
    Total params: 10,527,570
    Trainable params: 10,527,570
    Non-trainable params:0
    ______________________________________________________________
    None
    WARNING:tensorflow:From 
    D:Python37Libsite-packages	ensorflowpythonopsmath_ops.py:3066: to_int32 (from 
    tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
    Instructions for updating:
    Use tf.cast instead.
    Epoch 1/20
    2019-07-06 23:03:46.934966: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU 
    supports instructions that this TensorFlow binary was not compiled to use: AVX2
    
       64/11314 [..............................] - ETA: 4:41 - loss: 2.9946
      128/11314 [..............................] - ETA: 2:43 - loss: 2.9948
      192/11314 [..............................] - ETA: 2:03 - loss: 2.9951
      256/11314 [..............................] - ETA: 1:43 - loss: 2.9947
      320/11314 [..............................] - ETA: 1:32 - loss: 2.9938
      此处省略一堆epoch的一堆操作
      
    Deep Neural Network - Train accuracy: 0.999
    Deep Neural Network - Test accuracy: 0.811
    
    Deep Neural Network - Train Classification Report
                  precision    recall  f1-score   support
    
               0       1.00      1.00      1.00       480
               1       1.00      0.99      1.00       584
               2       0.99      1.00      1.00       591
               3       1.00      1.00      1.00       590
               4       1.00      1.00      1.00       578
               5       1.00      1.00      1.00       593
               6       1.00      1.00      1.00       585
               7       1.00      1.00      1.00       594
               8       1.00      1.00      1.00       598
               9       1.00      1.00      1.00       597
              10       1.00      1.00      1.00       600
              11       1.00      1.00      1.00       595
              12       1.00      1.00      1.00       591
              13       1.00      1.00      1.00       594
              14       1.00      1.00      1.00       593
              15       1.00      1.00      1.00       599
              16       1.00      1.00      1.00       546
              17       1.00      1.00      1.00       564
              18       1.00      1.00      1.00       465
              19       1.00      1.00      1.00       377
    
        accuracy                           1.00     11314
       macro avg       1.00      1.00      1.00     11314
    weighted avg       1.00      1.00      1.00     11314
    
    Deep Neural Network - Test Classification Report
                  precision    recall  f1-score   support
    
               0       0.78      0.78      0.78       319
               1       0.70      0.74      0.72       389
               2       0.68      0.69      0.68       394
               3       0.71      0.69      0.70       392
               4       0.82      0.76      0.79       385
               5       0.84      0.74      0.78       395
               6       0.73      0.87      0.80       390
               7       0.85      0.86      0.86       396
               8       0.93      0.91      0.92       398
               9       0.89      0.91      0.90       397
              10       0.96      0.97      0.96       399
              11       0.87      0.95      0.91       396
              12       0.69      0.72      0.70       393
              13       0.88      0.77      0.82       396
              14       0.83      0.92      0.87       394
              15       0.91      0.84      0.88       398
              16       0.78      0.83      0.80       364
              17       0.97      0.87      0.92       376
              18       0.74      0.66      0.70       310
              19       0.59      0.62      0.61       251
    
        accuracy                           0.81      7532
       macro avg       0.81      0.81      0.81      7532
    weighted avg       0.81      0.81      0.81      7532
    
  • 相关阅读:
    SCCM2012 R2实战系列之四:初始化配置
    SCCM 2012 R2实战系列之一:SQL安装
    hdu 1242(bfs)
    hdu 1728(bfs)
    hdu 1253(bfs)
    hdu 3661
    hdu 1072(bfs)
    AC模版
    hdu 1010(dfs)
    poj 3628(01_page, dfs)
  • 原文地址:https://www.cnblogs.com/peng8098/p/nlp_17.html
Copyright © 2011-2022 走看看