zoukankan      html  css  js  c++  java
  • 自然语言处理之文本分类

    一、前言

    该部分工作是研二期间总结的知识点,今天从有道云笔记上搬运过来。

    二、深度学习框架-keras

    人生苦短,我用keras。

    官方keras例子:http://keras-cn.readthedocs.io/en/latest/getting_started/sequential_model/

    三、基于深度学习的文本分类流程

    1.首先将文本数据集中的单词用词嵌入矩阵中的索引表示,由于文本是由单词拼接出来的,所以最终得到一个由单词索引拼接起来的数字序列来表示该文本。

    dataloader.py:数据预处理

    def read_category():
        """
        Args:
            None
        Returns:
            categories: a list of label
            cat_to_id: a dict of label to id
    
        """
        ##DBPedia
        # y1 = ['agent', 'device', 'event', 'place', 'species', 'sportsseason', 'topicalconcept', 'unitofwork', 'work']
        # y2 = ['actor', 'amusementparkattraction', 'animal', 'artist', 'athlete', 'bodyofwater', 'boxer', 'britishroyalty', 'broadcaster', 'building', 'cartoon', 'celestialbody', 'cleric', 'clericaladministrativeregion', 'coach', 'comic', 'comicscharacter', 'company', 'database', 'educationalinstitution', 'engine', 'eukaryote', 'fictionalcharacter', 'floweringplant', 'footballleagueseason', 'genre', 'gridironfootballplayer', 'group', 'horse', 'infrastructure', 'legalcase', 'motorcyclerider', 'musicalartist', 'musicalwork', 'naturalevent', 'naturalplace', 'olympics', 'organisation', 'organisationmember', 'periodicalliterature', 'person', 'plant', 'politician', 'presenter', 'race', 'racetrack', 'racingdriver', 'routeoftransportation', 'satellite', 'scientist', 'settlement', 'societalevent', 'software', 'song', 'sportfacility', 'sportsevent', 'sportsleague', 'sportsmanager', 'sportsteam', 'sportsteamseason', 'station', 'stream', 'tournament', 'tower', 'venue', 'volleyballplayer', 'wintersportplayer', 'wrestler', 'writer', 'writtenwork']
        # y3 = ['academicjournal', 'adultactor', 'airline', 'airport', 'album', 'amateurboxer', 'ambassador', 'americanfootballplayer', 'amphibian', 'animangacharacter', 'anime', 'arachnid', 'architect', 'artificialsatellite', 'artistdiscography', 'astronaut', 'australianfootballteam', 'australianrulesfootballplayer', 'automobileengine', 'badmintonplayer', 'band', 'bank', 'baronet', 'baseballleague', 'baseballplayer', 'baseballseason', 'basketballleague', 'basketballplayer', 'basketballteam', 'beachvolleyballplayer', 'beautyqueen', 'biologicaldatabase', 'bird', 'bodybuilder', 'brewery', 'bridge', 'broadcastnetwork', 'buscompany', 'businessperson', 'canadianfootballteam', 'canal', 'canoeist', 'cardinal', 'castle', 'cave', 'chef', 'chessplayer', 'christianbishop', 'classicalmusicartist', 'classicalmusiccomposition', 'collegecoach', 'comedian', 'comicscreator', 'comicstrip', 'congressman', 'conifer', 'convention', 'cricketer', 'cricketground', 'cricketteam', 'crustacean', 'cultivatedvariety', 'curler', 'cycad', 'cyclingrace', 'cyclingteam', 'cyclist', 'dam', 'dartsplayer', 'diocese', 'earthquake', 'economist', 'election', 'engineer', 'entomologist', 'eurovisionsongcontestentry', 'fashiondesigner', 'fern', 'figureskater', 'filmfestival', 'fish', 'footballmatch', 'formulaoneracer', 'fungus', 'gaelicgamesplayer', 'galaxy', 'glacier', 'golfcourse', 'golfplayer', 'golftournament', 'governor', 'grandprix', 'grape', 'greenalga', 'gymnast', 'handballplayer', 'handballteam', 'historian', 'historicbuilding', 'hockeyteam', 'hollywoodcartoon', 'horserace', 'horserider', 'horsetrainer', 'hospital', 'hotel', 'icehockeyleague', 'icehockeyplayer', 'insect', 'jockey', 'journalist', 'judge', 'lacrosseplayer', 'lake', 'lawfirm', 'legislature', 'library', 'lighthouse', 'magazine', 'manga', 'martialartist', 'mayor', 'medician', 'memberofparliament', 'militaryconflict', 'militaryperson', 'militaryunit', 'mixedmartialartsevent', 'model', 'mollusca', 'monarch', 'moss', 'mountain', 'mountainpass', 'mountainrange', 'museum', 'musical', 'musicfestival', 'musicgenre', 'mythologicalfigure', 'nascardriver', 'nationalfootballleagueseason', 'ncaateamseason', 'netballplayer', 'newspaper', 'noble', 'officeholder', 'olympicevent', 'painter', 'philosopher', 'photographer', 'planet', 'play', 'playboyplaymate', 'poem', 'poet', 'pokerplayer', 'politicalparty', 'pope', 'president', 'primeminister', 'prison', 'publictransitsystem', 'publisher', 'racecourse', 'racehorse', 'radiohost', 'radiostation', 'railwayline', 'railwaystation', 'recordlabel', 'religious', 'reptile', 'restaurant', 'river', 'road', 'roadtunnel', 'rollercoaster', 'rower', 'rugbyclub', 'rugbyleague', 'rugbyplayer', 'saint', 'school', 'screenwriter', 'senator', 'shoppingmall', 'single', 'skater', 'skier', 'soapcharacter', 'soccerclubseason', 'soccerleague', 'soccermanager', 'soccerplayer', 'soccertournament', 'solareclipse', 'speedwayrider', 'sportsteammember', 'squashplayer', 'stadium', 'sumowrestler', 'supremecourtoftheunitedstatescase', 'swimmer', 'tabletennisplayer', 'televisionstation', 'tennisplayer', 'tennistournament', 'theatre', 'town', 'tradeunion', 'university', 'videogame', 'village', 'voiceactor', 'volcano', 'winery', 'womenstennisassociationtournament', 'wrestlingevent']
        # y1_to_id=dict(zip(y1,range(len(y1))))
        # y2_to_id = dict(zip(y2, range(len(y2))))
        # y3_to_id = dict(zip(y3, range(len(y3))))
        ##ws
        y1 = ['business', 'communications', 'computer', 'data management', 'digital media', 'other services', 'recreational activities', 'social undertakings', 'traffic']
        # y1 = ['communications', 'data processing', 'digital media', 'economic', 'information technology', 'logistics', 'office', 'organization', 'other services', 'recreational activities', 'social undertakings']
        y2 = ['advertising', 'analytics', 'application development', 'backend', 'banking', 'bitcoin', 'chat', 'cloud', 'data', 'database', 'domains', 'ecommerce', 'education', 'email', 'enterprise', 'entertainment', 'events', 'file sharing', 'financial', 'games', 'government', 'images', 'internet of things', 'mapping', 'marketing', 'media', 'medical', 'messaging', 'music', 'news services', 'other', 'payments', 'photos', 'project management', 'real estate', 'reference', 'science', 'search', 'security', 'shipping', 'social', 'sports', 'stocks', 'storage', 'telephony', 'tools', 'transportation', 'travel', 'video', 'weather']
        y1_to_id=dict(zip(y1,range(len(y1))))
        y2_to_id = dict(zip(y2, range(len(y2))))
        return y1,y1_to_id,y2,y2_to_id
    
    def read_files(filename):
        contents, labels1, labels2 = [], [], []
        i = 0
        with codecs.open(filename, 'r', encoding='utf-8') as f:
            for line in f:
                try:
                    content = line.split(' ')
                    stopWords = stopwordslist(
                        r'D:赵鲸朋pycharmModel0905pycharmModel0905PycharmProjectsWos-Metadata2txtoutputstopwordds')
                    # print(stopWords)
                    ###################contents############################
                    # wordsFiltered = []
                    # for w in content:
                    #     if w not in stopWords and not isSymbol(w) and not hasNumbers(w) and len(w) >= 2:
                    #         wordsFiltered.append(w.rstrip('
    ').rstrip('
    '))
                    # contents.append(wordsFiltered)
                    #####################label_y1 y2###################################
                    wordsFiltered = []
                    for w in content:
                        if len(w)>=2:
                            wordsFiltered.append(w.rstrip('
    ').rstrip('
    '))
                    contents.append(wordsFiltered)
                    #######################################################
                    i=i+1
                except:
                    pass
        print(len(contents))
        return contents
    # filename = r"D:赵鲸朋pycharmModel0905pycharmModel0905PycharmProjectsWos-Metadata2txtdatawoswos_clear_content.txt"
    # cont = read_files(filename)
    # # print(y1[:5])
    # # # print(y2[:5])
    # print(cont[:5])
    ##20200415注释
    # y1_file = r"D:赵鲸朋pycharmModel0905pycharmModel0905PycharmProjectsWos-Metadata2txtdatawoswos_clear_y1.txt"
    # y1 = read_files(y1_file)
    # print(y1)def process_file(cont_file,y1_file,y2_file, word_to_id,y1_to_id,y2_to_id, max_length=300,y1_length = 2,y2_length = 2):
        """
        Args:
            filename:train_filename or test_filename or val_filename
            word_to_id:get from def read_vocab()
            cat_to_id:get from def read_category()
            max_length:allow max length of sentence 
        Returns:
            x_pad: sequence data from  preprocessing sentence 
            y_pad: sequence data from preprocessing label
    
        """
        contents=read_files(cont_file)
        y1 = read_files(y1_file)
        y2 = read_files(y2_file)
        # y3 = read_files(y3_file)
        data_id,y1_id,y2_id,y3_id=[],[],[],[]
        y1_id_pad,y2_id_pad,y3_id_pad = [],[],[]
        label_y1 = []
        label_y2 = []
        label_y3 = []
        for i in range(len(contents)):
            data_id.append([word_to_id[x] for x in contents[i] if x in word_to_id])
            y1_id_pad.append([word_to_id[x] for x in y1[i] if x in word_to_id])
            y2_id_pad.append([word_to_id[x] for x in y2[i] if x in word_to_id])
            # y3_id_pad.append([word_to_id[x] for x in y3[i] if x in word_to_id])
            ##############y[i]=['computer','science']转化为y[i]=['computer science']#################################
    
            str = ""
            for label in y1[i]:
                str = str+ label + " "
            label_y1.append(str.rstrip(' '))
            # label_id.append(label_idd)
    
            str2 = ""
            for label in y2[i]:
                str2 = str2 + label + " "
            label_y2.append(str2.rstrip(' '))
    
            # str3 = ""
            # for label in y3[i]:
            #     str3 = str3 + label + " "
            # label_y3.append(str3.rstrip(' '))
    
            y1_id.append(y1_to_id[label_y1[i]])
            y2_id.append(y2_to_id[label_y2[i]])
            # y3_id.append(y3_to_id[label_y3[i]])
            ###############################################
            # y1_id.append(y1_to_id[y1[i]])
            # y2_id.append(y2_to_id[y2[i]])
        cont_pad=kr.preprocessing.sequence.pad_sequences(data_id,max_length,padding='post', truncating='post')
        y1_pad = kr.preprocessing.sequence.pad_sequences(y1_id_pad, y1_length, padding='post', truncating='post')
        y2_pad = kr.preprocessing.sequence.pad_sequences(y2_id_pad, y2_length, padding='post', truncating='post')
        # y3_pad = kr.preprocessing.sequence.pad_sequences(y3_id_pad, y3_length, padding='post', truncating='post')
        ##################################
        y1_index = kr.utils.to_categorical(y1_id)
        y2_index = kr.utils.to_categorical(y2_id)
        # y3_index = kr.utils.to_categorical(y3_id)
        #####################################
    
        return cont_pad,y1_index,y2_index,y1_pad,y2_pad

    datasequence.py:数据的输入输出在这里操作,然后调用dataloader.py

    ##将文本转变为向量
    if __name__ == '__main__':
    
        start_time = time.time()
        # config = TextConfig()
        ###############################################################
        pretrained_w2v, word_to_id, _ = pl.load(
            open(r'D:赵鲸朋pycharmModel0905pycharmModel0905PycharmProjectsWos-Metadata2txtdatawosemb_matrix_glove_300', 'rb'))
        cont_file = r"D:赵鲸朋pycharmModel0905pycharmModel0905PycharmProjectsWos-Metadata2txt - webservicedataWSoutput0145ws_clear_content.txt"
        y1_file = r"D:赵鲸朋pycharmModel0905pycharmModel0905PycharmProjectsWos-Metadata2txt - webservicedataWSoutput0145ws_clear_y1.txt"
        y2_file = r"D:赵鲸朋pycharmModel0905pycharmModel0905PycharmProjectsWos-Metadata2txt - webservicedataWSoutput0145ws_clear_y2.txt"
        # y3_file = r"D:赵鲸朋pycharmModel0905pycharmModel0905PycharmProjectsWos-Metadata2txtdataDBPoutput2DBP_clear_y3.txt"
        #########################################################################
        y1, y1_to_id, y2, y2_to_id = read_category()
        # y1, y1_to_id, y2, y2_to_id, y3, y3_to_id= read_category()
        ####################################################################
        seq_length = 100
        cont_pad,y1_index,y2_index,y1_pad,y2_pad= process_file(cont_file,y1_file,y2_file, word_to_id, y1_to_id, y2_to_id, seq_length)
        # x_val, y_val = process_file(file, word_to_id, cat_to_id, config.seq_length)
        print(cont_pad[:3])
        print(y1_index[:3])
        print(y2_index[:3])
        print(y1_pad[:3])
        print(y2_pad[:3])
    
        with open('D:赵鲸朋pycharmModel0905pycharmModel0905PycharmProjectsWos-Metadata2txt - webservicedataWSoutput0145ws_txt_vector300dim_y1y2_2len_100len_zjp0145', 'wb') as f:
            pl.dump((cont_pad,y1_index,y2_index,y1_pad,y2_pad), f)
    
        # trans vector file to numpy file
        # if not os.path.exists(config.vector_word_npz):
        #     export_word2vec_vectors(word_to_id, config.vector_word_filename, config.vector_word_npz)
    
        # with open('./train_val_txt_vector', 'wb') as f:
        #     pl.dump((x_train, x_val, y_train, y_val ), f)
    
        print("Time cost: %.3f seconds...
    " % (time.time() - start_time))

    2.然后将该索引序列喂入深度学习模型即可。索引在下面这段代码会直接映射为词嵌入矩阵中的词向量,然后放入深度学习模型进行拟合。

    embedding_word_raw = Embedding(self.max_features, self.embedding_dims, weights=[self.word_embedding_matrix],input_length=self.maxlen,name='emb',trainable=False)(input)

    Main.py:模型的输入输出在下面这里操作

    maxlen = 100
    
    max_features = 89098
    batch_size = 64
    embedding_dims = 300
    epochs = 100
    #######################调用GPU#################
    """GPU设置为按需增长"""
    import os
    import tensorflow as tf
    import keras.backend.tensorflow_backend as KTF
    # 指定第一块GPU可用
    os.environ["CUDA_VISIBLE_DEVICES"] = "0"
    config = tf.ConfigProto()
    config.gpu_options.allow_growth=True   #不全部占满显存, 按需分配
    # config.gpu_options.per_process_gpu_memory_fraction = 0.6 # 每个GPU现存上届控制在60%以内
    sess = tf.Session(config=config)
    KTF.set_session(sess)
    ############################################################################################################################
    pretrained_w2v, _, _ = pl.load(open(r'D:1zjpE1106论文实验Web服务分类ServicesRecommenddataemb_matrix_glove_300', 'rb'))
    #############################################################################################################################
    print('Loading data...')
    x,y1,y2,y1_pad,y2_pad =pl.load(open(r'D:1zjpE1106论文实验Web服务分类ServicesRecommenddataws_txt_vector300dim_y1y2_2len_100len_zjp0145','rb'))
    x_train,x_test,y2_train,y2_test = train_test_split( x, y2, test_size=0.2, random_state=42)
    x_train,x_test,y1_train,y1_test = train_test_split( x, y1, test_size=0.2, random_state=42)
    ##########################################################################################################################
    
    print('Build model...')
    
    model = TextONLSTM(maxlen, max_features, embedding_dims, pretrained_w2v).get_model()
    ######################设置top5精度######################################################################################
    def acc_top5(y_true, y_pred):
        return keras.metrics.top_k_categorical_accuracy(y_true, y_pred, k=5)
    ##########################################################################################################################
    model.compile('adam', 'categorical_crossentropy', metrics=['accuracy',acc_top5])
    
    model.summary()
    print('Train...')
    fileweights = r"D:1zjpE1106论文实验Web服务分类ServicesRecommenddataweightsAy1pad_y2_best_weights.h5"
    checkpoint = ModelCheckpoint(fileweights, monitor='val_acc', verbose=1, save_best_only=True, mode='auto')
    early_stopping = EarlyStopping(monitor='val_acc', patience=5, mode='max')
    # 当评价指标不在提升时,减少学习率
    from keras.callbacks import ReduceLROnPlateau
    reduce_lr = ReduceLROnPlateau(monitor='val_loss',factor=0.1, patience=3, mode='auto')
    ###########训练层次标签之间权重##########################################################################################
    model.fit(x_train, y1_train,
              # validation_split=0.1,
              batch_size=batch_size,
              epochs=epochs,
              callbacks=[early_stopping, checkpoint,reduce_lr],
              validation_data=(x_test, y1_test),
              shuffle= True)


    Model.py:深度学校模型的结构在这里设计

    class TextONLSTM(object):
        def __init__(self, maxlen, max_features, embedding_dims,
                     word_embedding_matrix,
                     class_num=9,
                     last_activation='softmax'):
            self.maxlen = maxlen
            self.max_features = max_features
            self.embedding_dims = embedding_dims
            self.word_embedding_matrix = word_embedding_matrix
            self.class_num = class_num
            self.last_activation = last_activation
    
        def get_model(self):
            input = Input((self.maxlen,))
            ###############################################################
            #################################################################
            embedding_word_raw = Embedding(self.max_features, self.embedding_dims, weights=[self.word_embedding_matrix],input_length=self.maxlen,name='emb',trainable=False)(input)
            # embedding_word_raw = GlobalMaxPooling1D()(embedding_word_raw)
            embedding_word = Dropout(0.25,name='dropout1')(embedding_word_raw)
    
            ##################################################################################
            onlstm = ONLSTM(1024, 2, return_sequences=True, dropconnect=0.25, name="onlstm_1")(embedding_word)
            y0 = GlobalMaxPooling1D(name='pool1')(onlstm)
            ##############################################################################
    
            documentOut = Dense(512, activation="tanh", name="documentOut_1")(y0)
            # x_word = BatchNormalization()(documentOut)
            x_word = Dropout(0.5,name='Dropout2')(documentOut)
    
    
            output = Dense(self.class_num, activation=self.last_activation,name="output_1")(x_word)
            model = Model(inputs=input, outputs=output)
            return model

    四、多标签分类和单标签分类

    损失函数和最后一层激活函数要一一对应

    五、中文文本分类实例

    1.数据集格式

    2.先对其分词,然后根据分词生成一个词典

    #labels = ['教育','体育','社会',……]
    #contents = [['我们','北京','上大学','研究生'],['今天','巴萨','战胜','火箭'],['住房','公积金','上涨']……]

    3.根据单词在词典中的索引,将句子序列化(用单词索引集合表示)

    def process_file(filename,word_to_id,cat_to_id,max_length=600):
        """
        Args:
            filename:train_filename or test_filename or val_filename
            word_to_id:get from def read_vocab()
            cat_to_id:get from def read_category()
            max_length:allow max length of sentence 
        Returns:
            x_pad: sequence data from  preprocessing sentence 
            y_pad: sequence data from preprocessing label
        """
        labels,contents=read_file(filename)
        data_id,label_id=[],[]
        for i in range(len(contents)):
            data_id.append([word_to_id[x] for x in contents[i] if x in word_to_id])
            label_id.append(cat_to_id[labels[i]])
        x_pad=kr.preprocessing.sequence.pad_sequences(data_id,max_length,padding='post', truncating='post')
        y_pad=kr.utils.to_categorical(label_id)#y_pad单标签的处理,与多标签不同
        return x_pad,y_pad
    5.根据数据集训练词向量(word2vec)(不依赖于前四步,独立出来)
    re_han= re.compile(u"([u4E00-u9FD5a-zA-Z]+)") # the method of cutting text by punctuation
    
    class Get_Sentences(object):
        '''
    
        Args:
             filenames: a list of train_filename,test_filename,val_filename
        Yield:
            word:a list of word cut by jieba
    
        '''
    
        def __init__(self,filenames):
            self.filenames= filenames
    
        def __iter__(self):
            for filename in self.filenames:
                with codecs.open(filename, 'r', encoding='utf-8') as f:
                    for _,line in enumerate(f):
                        try:
                            line=line.strip()
                            line=line.split('	')
                            assert len(line)==2
                            blocks=re_han.split(line[1])
                            word=[]
                            for blk in blocks:
                                if re_han.match(blk):
                                    word.extend(jieba.lcut(blk))
                            yield word
                        except:
                            pass
    
    def train_word2vec(filenames):
        '''
        use word2vec train word vector
        argv:
            filenames: a list of train_filename,test_filename,val_filename
        return: 
            save word vector to config.vector_word_filename
    
        '''
        t1 = time.time()
        sentences = Get_Sentences(filenames)
        logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
        model = word2vec.Word2Vec(sentences, size=100, window=5, min_count=1, workers=6)
        model.wv.save_word2vec_format(config.vector_word_filename, binary=False)
        print('-------------------------------------------')
        print("Training word2vec model cost %.3f seconds...
    " % (time.time() - t1))
    
    if __name__ == '__main__':
        config=TextConfig()
        filenames=[config.train_filename,config.test_filename,config.val_filename]
        train_word2vec(filenames)

    6.把文本序列,以及词嵌入矩阵放入神经网络进行训练

     六、文本分类代码论文科研神站

    该网站会对文本分类的各种深度学习模型的性能进行一个排行

    论文阅读笔记:
    1.《CNN for Sentence Classification》(textcnn)阅读笔记
    https://zhuanlan.zhihu.com/p/59988106
    2.Recurrent Convolutional Neural Networks for Text Classification
    https://zhuanlan.zhihu.com/p/21253220
    3.HAN 模型阅读笔记
    https://zhuanlan.zhihu.com/p/26892711
    4.Universal Language Model Fine-tuning for Text Classification(2018年ACL)
    https://zhuanlan.zhihu.com/p/47344283
    5.基于Joint embedding of words and labels的文本分类(2018年ACL)
    https://zhuanlan.zhihu.com/p/54734708
    6.关于 Adversarial Training 在 NLP 领域的一些思考
    7.《基于语义单元的多标签文本分类》阅读笔记
    8.图嵌入(Graph embedding)综述
    《Graph Embedding Techniques, Applications, and Performance: A Survey》
    https://zhuanlan.zhihu.com/p/62629465
    9.【论文笔记】Structural Deep Network Embedding
    https://blog.csdn.net/jianbinzheng/article/details/83545754
    10.【Graph Embedding】DeepWalkLINENode2VecSDNEStruc2Vec
    https://zhuanlan.zhihu.com/p/56733145
    11.RMDL:随机多模型深度学习分类方法
    https://blog.csdn.net/oYeZhou/article/details/89096099
    https://zhuanlan.zhihu.com/p/68748697

  • 相关阅读:
    单词接龙
    字符串,字符数组
    马的遍历
    约瑟夫问题
    扫雷游戏
    寻找道路
    传纸条
    数的划分
    火柴棒等式
    火星人
  • 原文地址:https://www.cnblogs.com/jingpeng77/p/14508619.html
Copyright © 2011-2022 走看看