zoukankan      html  css  js  c++  java
  • 【深度学习系列】垃圾邮件处理实战(二)

    PaddlePaddle垃圾邮件处理实战(二)

    前文回顾

      在上篇文章中我们讲了如何用支持向量机对垃圾邮件进行分类,auc为73.3%,本篇讲继续讲如何用PaddlePaddle实现邮件分类,将深度学习方法运用到文本分类中。

    构建网络模型

      用PaddlePaddle来构建网络模型其实很简单,首先得明确paddlepaddle的输入数据的格式要求,知道如何构建网络模型,以及如何训练。关于输入数据的预处理等可以参考我之前写的这篇文章【深度学习系列】PaddlePaddle之数据预处理。首先我们先采用一个浅层的神经网络来进行训练。

    具体步骤

    • 读取数据
    • 划分训练集和验证集
    • 定义网络结构
    • 打印训练日志
    • 可视化训练结果

    读取数据

      在PaddlePaddle中,我们需要创建一个reador来读取数据,在上篇文章中,我们已经对原始数据处理好了,正负样本分别为ham.txt和spam.txxt,这里我们只需要加载数据即可。
    代码实现:

    # 加载数据
    def loadfile():
       # 加载正样本
       fopen = open('ham.txt','r')
       pos = []
       for line in fopen:
           pos.append(line)
           
       #加载负样本
       fopen = open('spam.txt','r')
       neg = []
       for line in fopen:
           neg.append(line)
           
       combined=np.concatenate((pos, neg))
       # 创建label
       y = np.concatenate((np.ones(len(pos),dtype=int), np.zeros(len(neg),dtype=int)))
       return combined,y
    
    # 创建paddlepaddle读取数据的reader 
    def reader_creator(dataset,label):
        def reader():
            for i in xrange(len(dataset)):
                    yield dataset[i,:],int(label[i])
        return reader
    

    创建词语索引:

    #创建词语字典,并返回每个词语的索引,词向量,以及每个句子所对应的词语索引
    def create_dictionaries(model=None,
                            combined=None):
        if (combined is not None) and (model is not None):
            gensim_dict = Dictionary()
            gensim_dict.doc2bow(model.wv.vocab.keys(),
                                allow_update=True)
            w2indx = {v: k+1 for k, v in gensim_dict.items()}#所有频数超过10的词语的索引
            w2vec = {word: model[word] for word in w2indx.keys()}#所有频数超过10的词语的词向量
    
            def parse_dataset(combined):
                ''' Words become integers
                '''
                data=[]
                for sentence in combined:
                    new_txt = []
                    sentences = sentence.split(' ')
                    for word in sentences:
    		    try:
    		        word = unicode(word, errors='ignore')
                            new_txt.append(w2indx[word])
                        except:
                            new_txt.append(0)
                    data.append(new_txt)
                return data
            combined=parse_dataset(combined)
            combined= sequence.pad_sequences(combined, maxlen=maxlen)#每个句子所含词语对应的索引,所以句子中含有频数小于10的词语,索引为0
            return w2indx, w2vec,combined
        else:
            print 'No data provided...'
    

    划分训练集和验证集

      这里我们采取sklearn的train_test_split函数对数据集进行划分,训练集和验证集的比例为4:1。
    代码实现:

    # 导入word2vec模型
    def word2vec_train(combined):
        model = Word2Vec.load('lstm_data/model/Word2vec_model.pkl')
        index_dict, word_vectors,combined = create_dictionaries(model=model,combined=combined)
        return   index_dict, word_vectors,combined
    
    # 获取训练集、验证集
    def get_data(index_dict,word_vectors,combined,y):
        n_symbols = len(index_dict) + 1  # 所有单词的索引数,频数小于10的词语索引为0,所以加1
        embedding_weights = np.zeros((n_symbols, vocab_dim))#索引为0的词语,词向量全为0
        for word, index in index_dict.items():#从索引为1的词语开始,对每个词语对应其词向量
            embedding_weights[index, :] = word_vectors[word]
        x_train, x_val, y_train, y_val = train_test_split(combined, y, test_size=0.2)
        print x_train.shape,y_train.shape
        return n_symbols,embedding_weights,x_train,y_train,x_val,y_val
    

    定义网络结构

    class NeuralNetwork(object):
        def __init__(self,X_train,Y_train,X_val,Y_val,vocab_dim,n_symbols,num_classes=2):
            paddle.init(use_gpu = with_gpu,trainer_count=1)
    
            self.X_train = X_train
            self.Y_train = Y_train
            self.X_val = X_val
            self.Y_val = Y_val
    	    self.vocab_dim = vocab_dim
    	    self.n_symbols = n_symbols
            self.num_classes=num_classes
    
        # 定义网络模型
        def get_network(self):
            # 分类模型
            x = paddle.layer.data(name='x', type=paddle.data_type.dense_vector(self.vocab_dim))
            y = paddle.layer.data(name='y', type=paddle.data_type.integer_value(self.num_classes))
            fc1 = paddle.layer.fc(input = x,size = 1280,act = paddle.activation.Linear())
            fc2 = paddle.layer.fc(input = fc1,size = 640,act = paddle.activation.Relu())
            prob = paddle.layer.fc(input = fc2,size = self.num_classes,act = paddle.activation.Softmax())
            predict = paddle.layer.mse_cost(input = prob,label = y)
        return predict
    
        # 定义训练器
        def get_trainer(self):
    
            cost = self.get_network()
    
            #获取参数
            parameters = paddle.parameters.create(cost)
    
            #定义优化方法
            optimizer0 = paddle.optimizer.Momentum(
                                    momentum=0.9,
                                    regularization=paddle.optimizer.L2Regularization(rate=0.0002 * 128),
                                    learning_rate=0.01 / 128.0,
                                    learning_rate_decay_a=0.01,
                                    learning_rate_decay_b=50000 * 100)
    	
    	optimizer1 = paddle.optimizer.Momentum(
                                    momentum=0.9,
                                    regularization=paddle.optimizer.L2Regularization(rate=0.0002 * 128),
                                    learning_rate=0.001,
                                    learning_rate_schedule = "pass_manual",
                                    learning_rate_args = "1:1.0, 8:0.1, 13:0.01")
    
    	optimizer = paddle.optimizer.Adam(
            			learning_rate=2e-3,
            			regularization=paddle.optimizer.L2Regularization(rate=8e-4),
            			model_average=paddle.optimizer.ModelAverage(average_window=0.5))
    
    
    
            # 创建训练器
            trainer = paddle.trainer.SGD(
                    cost=cost, parameters=parameters, update_equation=optimizer)
            return parameters,trainer
    
    
        # 开始训练
        def start_trainer(self,X_train,Y_train,X_val,Y_val):
            parameters,trainer = self.get_trainer()
    
            result_lists = []
            def event_handler(event):
                if isinstance(event, paddle.event.EndIteration):
                    if event.batch_id % 100 == 0:
                        print "
    Pass %d, Batch %d, Cost %f, %s" % (
                            event.pass_id, event.batch_id, event.cost, event.metrics)
                if isinstance(event, paddle.event.EndPass):
                        # 保存训练好的参数
                    with open('params_pass_%d.tar' % event.pass_id, 'w') as f:
                        parameters.to_tar(f)
                    # feeding = ['x','y']
                    result = trainer.test(
                            reader=val_reader)
                                # feeding=feeding)
                    print "
    Test with Pass %d, %s" % (event.pass_id, result.metrics)
    
                    result_lists.append((event.pass_id, result.cost,
                            result.metrics['classification_error_evaluator']))
    
            # 开始训练
            train_reader = paddle.batch(paddle.reader.shuffle(
                    reader_creator(X_train,Y_train),buf_size=20),
                    batch_size=4)
    
            val_reader = paddle.batch(paddle.reader.shuffle(
                    reader_creator(X_val,Y_val),buf_size=20),
                    batch_size=4)
    
            trainer.train(reader=train_reader,num_passes=5,event_handler=event_handler)
    
    	#找到训练误差最小的一次结果
    	best = sorted(result_lists, key=lambda list: float(list[1]))[0]
            print 'Best pass is %s, testing Avgcost is %s' % (best[0], best[1])
            print 'The classification accuracy is %.2f%%' % (100 - float(best[2]) * 100)
    

    训练模型

    #训练模型,并保存
    def train():
        print 'Loading Data...'
        combined,y=loadfile()
        print len(combined),len(y)
        print 'Tokenising...'
        combined = tokenizer(combined)
        print 'Training a Word2vec model...'
        index_dict, word_vectors,combined=word2vec_train(combined)
        print 'Setting up Arrays for Keras Embedding Layer...'
        n_symbols,embedding_weights,x_train,y_train,x_val,y_val=get_data(index_dict, word_vectors,combined,y)
        print x_train.shape,y_train.shape
        network = NeuralNetwork(X_train = x_train,Y_train = y_train,X_val = x_val, Y_val = y_val,vocab_dim = vocab_dim,n_symbols = n_symbols,num_classes = 2)
        network.start_trainer(x_train,y_train,x_val,y_val)
    
    if __name__=='__main__':
        train()
    

    性能测试

      设置迭代5次,输出结果如下:

    Using TensorFlow backend.
    Loading Data...
    63000 63000
    Tokenising...
    Building prefix dict from the default dictionary ...
    [DEBUG 2018-01-29 00:29:19,184 __init__.py:111] Building prefix dict from the default dictionary ...
    Loading model from cache /tmp/jieba.cache
    [DEBUG 2018-01-29 00:29:19,185 __init__.py:131] Loading model from cache /tmp/jieba.cache
    Loading model cost 0.253 seconds.
    [DEBUG 2018-01-29 00:29:19,437 __init__.py:163] Loading model cost 0.253 seconds.
    Prefix dict has been built succesfully.
    [DEBUG 2018-01-29 00:29:19,437 __init__.py:164] Prefix dict has been built succesfully.
    I0128 12:29:17.325337 16772 GradientMachine.cpp:101] Init parameters done.
    Pass 0, Batch 0, Cost 0.519137, {'classification_error_evaluator': 0.25}
    Pass 0, Batch 100, Cost 0.410812, {'classification_error_evaluator': 0}
    Pass 0, Batch 200, Cost 0.486661, {'classification_error_evaluator': 0.25}
    ···
    Pass 4, Batch 12200, Cost 0.508126, {'classification_error_evaluator': 0.25}
    Pass 4, Batch 12300, Cost 0.312028, {'classification_error_evaluator': 0.25}
    Pass 4, Batch 12400, Cost 0.259026, {'classification_error_evaluator': 0.0}
    Pass 4, Batch 12500, Cost 0.177996, {'classification_error_evaluator': 0.25}
    Test with Pass 4, {'classification_error_evaluator': 0.15238096714019775}
    Best pass is 4, testing Avgcost is 0.716855627394
    The classification accuracy is 84.76%
    

      由此可以看到,仅迭代5次paddlepaddle的结果即可达到84.76%,如果增加迭代次数,可以达到更高的准确率。

    总结

      本篇文章讲了如何用paddlepaddle来进行垃圾邮件分类,采取一个简单的浅层神经网络来训练模型,迭代5次的准确率即为84.76%。在实际操作过程中,大家可以增加迭代次数,提高模型的精度,也可采取一些其他的方法,譬如文本CNN模型,LSTM模型来训练以获得更好的效果。

    本文首发于景略集智,并由景略集智制作成“PaddlePaddle调戏邮件诈骗犯”系列视频。如果有不懂的,欢迎在评论区中提问~

  • 相关阅读:
    set_ip_pool
    ubunutu_install_sublime_china
    ubuntu14_gtk 安装
    ubuntu14_pip 安装
    ActiveMQ基础教程(一):认识ActiveMQ
    EFCore:关于DDD中值对象(Owns)无法更新数值
    简单的制作ssl证书,并在nginx和IIS中使用
    .net core中Grpc使用报错:The remote certificate is invalid according to the validation procedure.
    .net core中Grpc使用报错:The response ended prematurely.
    .net core中Grpc使用报错:Request protocol 'HTTP/1.1' is not supported.
  • 原文地址:https://www.cnblogs.com/charlotte77/p/9143536.html
Copyright © 2011-2022 走看看