zoukankan      html  css  js  c++  java
  • 中文情感识别 2

    中文情感识别 2

    IMDB 影评情感分析

    笔记参考自:

    • 深度学习:基于 Keras 的 Python 实践 / 魏贞原著. —北京:电子工业出版社, 2018.5

    问题描述

    在这里使用 IMDB 提供的数据集中的评论信息来分析一部电影的好坏,数据集由
    IMDB( http://www.imdb.com/interfaces/)提供,其中包含了 25000 部电影的评价信息。

    导入数据、熟悉数据

    关键问题:

    代码

    """ 
    情感分析实例:IMDB 影评情感分析 
    """
    # %%
    from keras.datasets import imdb
    import numpy as np
    from matplotlib import pyplot as plt
    # %% 导入数据
    (x_train, y_train), (x_test, y_test) = imdb.load_data()
    # 合并训练数据集和评估数据集
    x = np.concatenate((x_train, x_test), axis=0)
    y = np.concatenate((y_train, y_test), axis=0)
    
    print('x shape is %s, y shape is %s' % (x.shape, y.shape))
    print('Classes: %s' % np.unique(y))
    print('Total words: %s' % len(np.unique(np.hstack(x))))
    
    result = [len(word) for word in x]
    print('Mean: %.2f words (STD: %.2f)' %(np.mean(result), np.std(result)))
    # 图表展示
    plt.subplot(121)
    plt.boxplot(result)
    plt.subplot(122)
    plt.hist(result)
    plt.show()

    结果

    输出
    输出

    句子长度分布
    句子长度分布

    词嵌入

    词嵌入( Word Embeddings)来源于 Bengio 的论文 Neural Probabilistic LanguageModels,是一种将词向量化的概念,是最近自然语言处理领域中的突破。其原理是,单词在高维空间中被编码为实值向量, 其中词语之间的相似性意味着向量空间中的接近度。离散词被映射到连续数的向量。

    Keras 通过嵌入层( Embedding)将单词的正整数表示转换为词嵌入。嵌入层需要指定词汇大小预期的最大数量,以及输出的每个词向量的维度。

    通过词嵌入来处理 IMDB 数据集,假设只对数据集前 5000 个最常用的单词感兴趣。因此,词向量的大小将为 5000。选择使用 32 维向量来表示每个单词,构建嵌入层输出。而且,将评价的长度限制在 500 个单词以内,长度超过 500 个单词的将转化为比 0 更短的值。

    关键问题

    • imdb.load_data(num_words=5000)
      参数说明
      num_words: integer or None. Top most frequent words to consider. Any less frequent word will appear as oov_char value in the sequence data. 即选择频率排名在5000以前的单词经行研究。

    • sequence.pad_sequences
      序列补齐函数 :每一个评论长短是不一致的,补齐以后得到相同的句子

    代码

    """
    情感分析实例:IMDB 影评情感分析
    """
    from keras.datasets import imdb
    from keras.preprocessing import sequence
    from keras.layers.embeddings import Embedding
    (x_train, y_train), (x_validation, y_validation) = imdb.load_data(num_words=5000)
    
    x_train = sequence.pad_sequences(x_train, maxlen=500)
    x_validation = sequence.pad_sequences(x_validation, maxlen=500)
    
    # 构建嵌入层
    Embedding(5000, 32, input_length=500)
    

    MLP Model

    关键问题

    • model.summary()
      prints a summary representation of your model.
    • Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.

    代码

    """
    情感分析实例:IMDB 影评情感分析
    """
    import os
    os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
    from keras.datasets import imdb
    import numpy as np
    from keras.preprocessing import sequence
    from keras.layers.embeddings import Embedding
    from keras.layers import Dense, Flatten
    from keras.models import Sequential
    
    seed =7
    top_words = 5000
    max_words = 500
    out_dimension = 32
    batch_size = 128
    epochs = 2
    
    def create_model():
        model = Sequential()
        #构建嵌入层
        model.add(Embedding(
            top_words, out_dimension, input_length=max_words
            )
        )
        model.add(
            Flatten()
        )
        model.add(Dense(250, activation='relu'))
        model.add(Dense(1,activation='sigmoid'))
        model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
        model.summary()
        return model
    np.random.seed(seed)
    # 导入数据
    (x_train, y_train), (x_validation, y_validation) = imdb.load_data(num_words=top_words)
    # 限定数据集的长度
    x_train = sequence.pad_sequences(x_train, maxlen=max_words)
    x_validation = sequence.pad_sequences(x_validation, maxlen=max_words)
    model = create_model()
    model.fit(x_train,y_train,validation_data=(x_validation, y_validation), batch_size = batch_size, epochs=epochs, verbose=2)

    CNN

    参考

    代码

    """
    情感分析实例:IMDB 影评情感分析
    CNN
    """
    import os
    os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
    from keras.datasets import imdb
    import numpy as np
    from keras.preprocessing import sequence
    from keras.layers.embeddings import Embedding
    from keras.layers.convolutional import Conv1D, MaxPooling1D
    from keras.layers import Dense, Flatten
    from keras.models import Sequential
    
    seed = 7
    top_words = 5000
    max_words = 500
    out_dimension = 32
    batch_size = 128
    epochs = 2
    
    def create_model():
        model = Sequential()
        model.add(
            Embedding(top_words, output_dim=out_dimension, input_length=max_words)
        )
        model.add(
            Conv1D(filters=32, kernel_size=3, padding='same', activation='relu')
        )
        model.add(MaxPooling1D(pool_size=2))
        model.add(Flatten())
        model.add(Dense(250, activation='relu'))
        model.add(Dense(1, activation='sigmoid'))
        model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
        model.summary()
        return model    
    
    np.random.seed(seed=seed)
    (x_train, y_train), (x_validation, y_validation) = imdb.load_data(num_words=top_words)
    x_train = sequence.pad_sequences(x_train, maxlen=max_words)
    x_validation = sequence.pad_sequences(x_validation, maxlen=max_words)
    
    # 生成模型
    model = create_model()
    model.fit(x_train, y_train, validation_data=(x_validation, y_validation), batch_size=batch_size,epochs=epochs, verbose=2)

    结果

    模型结构

    Model: "sequential_3"
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #   
    =================================================================
    embedding_3 (Embedding)      (None, 500, 32)           160000    
    _________________________________________________________________
    conv1d_1 (Conv1D)            (None, 500, 32)           3104      
    _________________________________________________________________
    max_pooling1d_1 (MaxPooling1 (None, 250, 32)           0         
    _________________________________________________________________
    flatten_2 (Flatten)          (None, 8000)              0         
    _________________________________________________________________
    dense_3 (Dense)              (None, 250)               2000250   
    _________________________________________________________________
    dense_4 (Dense)              (None, 1)                 251       
    =================================================================
    Total params: 2,163,605
    Trainable params: 2,163,605
    Non-trainable params: 0

    训练结果

    Train on 25000 samples, validate on 25000 samples
    Epoch 1/2
     - 34s - loss: 0.4359 - accuracy: 0.7736 - val_loss: 0.2756 - val_accuracy: 0.8862
    Epoch 2/2
     - 33s - loss: 0.2091 - accuracy: 0.9178 - val_loss: 0.3009 - val_accuracy: 0.8742
  • 相关阅读:
    网络流二十四题之魔术球问题
    网络流二十四题之P2764 最小路径覆盖问题
    网络二十四题 之 P2756 飞行员配对方案问题
    网络流 之 dinic算法
    网络流 之 增广路
    中南
    2249: Altruistic Amphibians 01背包的应用 + lh的简单图论 图转树求lca
    今日训练 搜索
    AD-logon workstation
    Centos7-docker安装
  • 原文地址:https://www.cnblogs.com/Howbin/p/12598287.html
Copyright © 2011-2022 走看看