zoukankan      html  css  js  c++  java
  • 使用LSTM做电影评论负面检测——使用朴素贝叶斯才51%,但是使用LSTM可以达到99%准确度

     基本思路:

    每个评论取前200个单词。然后生成词汇表,利用词汇index标注评论(对 每条评论的前200个单词编号而已),然后使用LSTM做正负评论检测。

     代码解读见【【【评论】】】!embedding层本质上是word2vec!!!在进行数据降维,但是不是所有的LSTM都需要这个,比如在图像检测mnist时候,就没有这层!

    import tensorflow as tf
    from tensorflow.contrib.learn.python import learn
    from sklearn import metrics
    from sklearn.model_selection import train_test_split
    import numpy as np
    from sklearn.naive_bayes import GaussianNB
    import os
    from sklearn.feature_extraction.text import CountVectorizer
    from tensorflow.contrib.layers.python.layers import encoders
    from sklearn import svm
    import tflearn
    from tflearn.data_utils import to_categorical, pad_sequences
    from tflearn.datasets import imdb
    
    
    MAX_DOCUMENT_LENGTH = 200
    EMBEDDING_SIZE = 50
    
    n_words=0
    
    
    def load_one_file(filename):
        x=""
        with open(filename) as f:
            for line in f:
                x+=line
        return x
    
    def load_files(rootdir,label):
        list = os.listdir(rootdir)
        x=[]
        y=[]
        for i in range(0, len(list)):
            path = os.path.join(rootdir, list[i])
            if os.path.isfile(path):
                print "Load file %s" % path
                y.append(label)
                x.append(load_one_file(path))
        return x,y
    
    
    def load_data():
        x=[]
        y=[]
        x1,y1=load_files("../data/movie-review-data/review_polarity/txt_sentoken/pos/",0)
        x2,y2=load_files("../data/movie-review-data/review_polarity/txt_sentoken/neg/", 1)
        x=x1+x2
        y=y1+y2
        return x,y 
    
    
    
    def do_rnn(trainX, testX, trainY, testY):
        global n_words
        # Data preprocessing
        # Sequence padding
        print "GET n_words embedding %d" % n_words
    
    
        trainX = pad_sequences(trainX, maxlen=MAX_DOCUMENT_LENGTH, value=0.)
        testX = pad_sequences(testX, maxlen=MAX_DOCUMENT_LENGTH, value=0.)
        # Converting labels to binary vectors
        trainY = to_categorical(trainY, nb_classes=2)
        testY = to_categorical(testY, nb_classes=2)
    
    print trainX[:10]
    print testX[:10]
    # Network building net = tflearn.input_data([None, MAX_DOCUMENT_LENGTH]) net = tflearn.embedding(net, input_dim=n_words, output_dim=128) net = tflearn.lstm(net, 128, dropout=0.8) net = tflearn.fully_connected(net, 2, activation='softmax') net = tflearn.regression(net, optimizer='adam', learning_rate=0.001, loss='categorical_crossentropy') # Training model = tflearn.DNN(net, tensorboard_verbose=3) model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True, batch_size=32,run_id="maidou") def do_NB(x_train, x_test, y_train, y_test): gnb = GaussianNB() y_predict = gnb.fit(x_train, y_train).predict(x_test) score = metrics.accuracy_score(y_test, y_predict) print('NB Accuracy: {0:f}'.format(score)) def main(unused_argv): global n_words x,y=load_data() x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=0) vp = learn.preprocessing.VocabularyProcessor(max_document_length=MAX_DOCUMENT_LENGTH, min_frequency=1) vp.fit(x) x_train = np.array(list(vp.transform(x_train))) x_test = np.array(list(vp.transform(x_test))) n_words=len(vp.vocabulary_) print('Total words: %d' % n_words) do_NB(x_train, x_test, y_train, y_test) do_rnn(x_train, x_test, y_train, y_test) if __name__ == '__main__': tf.app.run()

    负面的示例评论:

    plot : two teen couples go to a church party , drink and then drive .  
    they get into an accident .    
    one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . 
    what's the deal ? 
    watch the movie and " sorta " find out . . . 
    critique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . 
    which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly . 
    they seem to have taken this pretty neat concept , but executed it terribly . 
    so what are the problems with the movie ? 
    well , its main problem is that it's simply too jumbled .  
    it starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience member , have no idea what's going on .  
    there are dreams , there are characters coming back from the dead , there are others who look like the dead , there are strange apparitions , there are disappearances , there are a looooot of chase scenes , there are tons of weird things that happen , and most of it is simply not explained .          
    now i personally don't mind trying to unravel a film every now and then , but when all it does is give me the same clue over and over again , i get kind of fed up after a while , which is this film's biggest problem . 
    it's obviously got this big secret to hide , but it seems to want to hide it completely until its final five minutes .  
    and do they make things entertaining , thrilling or even engaging , in the meantime ? 
    not really . 
    the sad part is that the arrow and i both dig on flicks like this , so we actually figured most of it out by the half-way point , so all of the strangeness after that did start to make a little bit of sense , but it still didn't the make the film all that more entertaining . 
    i guess the bottom line with movies like this is that you should always make sure that the audience is " into it " even before they are given the secret password to enter your world of understanding . 
    i mean , showing melissa sagemiller running away from visions for about 20 minutes throughout the movie is just plain lazy ! ! 
    okay , we get it . . . there   
    are people chasing her and we don't know who they are . 
    do we really need to see it over and over again ? 
    how about giving us different scenes offering further insight into all of the strangeness going down in the movie ? 
    apparently , the studio took this film away from its director and chopped it up themselves , and it shows .  
    there might've been a pretty decent teen mind-fuck movie in here somewhere , but i guess " the suits " decided that turning it into a music video with little edge , would make more sense .  
    the actors are pretty good for the most part , although wes bentley just seemed to be playing the exact same character that he did in american beauty , only in a new neighborhood .  
    but my biggest kudos go out to sagemiller , who holds her own throughout the entire film , and actually has you feeling her character's unraveling . 
    overall , the film doesn't stick because it doesn't entertain , it's confusing , it rarely excites and it feels pretty redundant for most of its runtime , despite a pretty cool ending and explanation to all of the craziness that came before it . 
    oh , and by the way , this is not a horror or teen slasher flick . . . it's
    just packaged to look that way because someone is apparently assuming that the genre is still hot with the kids .
    it also wrapped production two years ago and has been sitting on the shelves ever since . 
    whatever . . . skip 
    it !        

    正面的:

    films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . 
    for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen .
    to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd .  
    the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes .  
    in other words , don't dismiss this film because of its source .  
    if you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes .
    getting the hughes brothers to direct this seems almost as ludicrous as casting carrot top in , well , anything , but riddle me this : who better to direct a film that's set in the ghetto and features really violent street crime than the mad geniuses behind menace ii society ? 
    the ghetto in question is , of course , whitechapel in 1888 london's east end .
    it's a filthy , sooty place where the whores ( called " unfortunates " ) are starting to get a little nervous about this mysterious psychopath who has been carving through their profession with surgical precision .
    when the first stiff turns up , copper peter godley ( robbie coltrane , the world is not enough ) calls in inspector frederick abberline ( johnny depp , blow ) to crack the case . 
    abberline , a widower , has prophetic dreams he unsuccessfully tries to quell with copious amounts of absinthe and opium . 
    upon arriving in whitechapel , he befriends an unfortunate named mary kelly ( heather graham , say it isn't so ) and proceeds to investigate the horribly gruesome crimes that even the police surgeon can't stomach .
    i don't think anyone needs to be briefed on jack the ripper , so i won't go into the particulars here , other than to say moore and campbell have a unique and interesting theory about both the identity of the killer and the reasons he chooses to slay .
    in the comic , they don't bother cloaking the identity of the ripper , but screenwriters terry hayes ( vertical limit ) and rafael yglesias ( les mis ? rables ) do a good job of keeping him hidden from viewers until the very end . 
    it's funny to watch the locals blindly point the finger of blame at jews and indians because , after all , an englishman could never be capable of committing such ghastly acts .
    and from hell's ending had me whistling the stonecutters song from the simpsons for days ( " who holds back the electric car/who made steve guttenberg a star ? " ) . 
    don't worry - it'll all make sense when you see it . 
    now onto from hell's appearance : it's certainly dark and bleak enough , and it's surprising to see how much more it looks like a tim burton film than planet of the apes did ( at times , it seems like sleepy hollow 2 ) .
    the print i saw wasn't completely finished ( both color and music had not been finalized , so no comments about marilyn manson ) , but cinematographer peter deming ( don't say a word ) ably captures the dreariness of victorian-era london and helped make the flashy killing scenes remind me of the crazy flashbacks in twin peaks , even though the violence in the film pales in comparison to that in the black-and-white comic . 
    oscar winner martin childs' ( shakespeare in love ) production design turns the original prague surroundings into one creepy place . 
    even the acting in from hell is solid , with the dreamy depp turning in a typically strong performance and deftly handling a british accent . 
    ians holm ( joe gould's secret ) and richardson ( 102 dalmatians ) log in great supporting roles , but the big surprise here is graham . 
    i cringed the first time she opened her mouth , imagining her attempt at an irish accent , but it actually wasn't half bad . 
    the film , however , is all good . 
    2 : 00 - r for strong violence/gore , sexuality , language and drug content 

    pad后和category后的数据示例:

    padded and cated data:
    trainX 3条: [[
    1299 6 1 1596 26 354 155 1 62 101 537 252 5 22 2048 516 4 140 252 119 19 1 147 226 16 56 19 2 435 37 2 77 648 15 1 164 222 22 389 12 93 39 19392 235 16 189 83 1299 6 2 1426 453 7 976 1375 97 1 67 6 2928 42 2489 58 1 251 225 3 36 4 133 120 305 138 8 4730 244 1274 70 1018 4 49 14539 2290 947 3881 22772 1594 1296 67 1812 11 2663 9397 7513 3133 2 1619 8232 307 16958 4 2015 1329 1 16813 3571 4 869 3376 5 1019 41 7 518 33 598 7 1 1600 4 15406 1473 29 2 77 199 812 15956 21 33 1841 315 1852 371 5280 27 468 2663 343 2 334 11397 1619 5 1562 47 19 0 3 4239 11 100 10 234 219 10 0 0 8 30 4 220 144 1 414 4 3226 11120 3161 92 299 366 725 1010 27520 5 3343 76 7 1 1205 12 12549 1121 4 44 1 2195 9938 6 23 0 12 2663 6858 5 1425 19 2 1378] [ 1361 1 1647 4 1 4974 130 26 11041 1126 130 1232 1 57 26 7 269 641 5 205 3325 1053 3 5152 6318 622 2 5999 4 911 223 14 3772 5166 15739 6635 2036 633 1 2146 778 2697 327 9589 8311 3 3031 19 36 1 4974 8164 28 1 3103 4276 6344 27 618 2 4266 5 1 4203 1427 1199 1083 7 150 192 1 2294 3 15520 185 52 6 2 3689 572 4 6431 15520 6635 6 130 1232 2 5020 778 503 12 36 2805 4 1538 9333 4795 1518 4 25 405 1539 17927 6489 1427 6646 34 17491 13 3501 99 1232 5309 17 90 2 4074 1232 32 68 13660 162 5 2 7412 258 83 4 405 460 11 8238 12857 18618 3890 922 3915 3 146 32 5 488 10 2125 9736 5 2 2217 16298 3915 81 2529 48 1232 996 4 54 1053 522 18 157 9 410 24 25 4 23045 348 24 1535 35 1689 1 5410 1232 23995 3 4 220 9 340 41 1053 6 1391 18618 9608 16865 1232 24 272 6 681 7 0 100 20 109 642] [ 83 59 25 11208 9 371 3442 7 2 546 181 29 176 158 13 546 25133 3 13 1554 4819 20 25356 12 36 46 5311 1 1075 4 3442 169 31 134 5 75 11 1 98 44 104 22 6759 12 2 13377 235 1397 4 1 1948 826 26697 371 3442 1605 13 260 1364 12771 4462 7 2 429 1340 29 1 164 63 1142 7782 4587 1599 6 7 1 1758 1217 12 2 541 8661 1142 168 10363 541 3 9 588 33 5 826 37 1 546 4553 4 36 140 300 93 97 361 168 2 8661 28 2 1988 508 3 102 6 2524 5 7651 100 516 1180 20 4837 11 13791 5 1 67 8 115 245 529 391 109 2 821 78 578 198 715 5 103 1218 95 5 1 415 662 337 415 1605 337 1415 3 10 2 571 6 2311 2812 3 10809 3442 3 1599 245 2349 5 4402 87 4339 3 18 1422 6642 12 2 11316 8790 5 819 46 116 266 193 1599 32 1585 5 141 85 7 546 487 144 18 1537 3442 124 41 4 13]]
    trainY 3条:
    [[ 1. 0.] [ 0. 1.] [ 0. 1.]]

     其中,MAX_DOCUMENT_LENGTH = 200,由于每个文档都进行了剪切。超过200的就直接截断文本,不再计算了!!!因为:

    tf.contrib.learn.preprocessing.VocabularyProcessor (max_document_length, min_frequency=0, vocabulary=None, tokenizer_fn=None)

    参数:

    max_document_length: 文档的最大长度。如果文本的长度大于最大长度,那么它会被剪切,反之则用0填充。
    min_frequency: 词频的最小值,出现次数小于最小词频则不会被收录到词表中。
    vocabulary: CategoricalVocabulary 对象。
    tokenizer_fn:分词函数

    代码:

    from tensorflow.contrib import learn
    import numpy as np
    max_document_length = 4
    x_text =[
        'i love you',
        'me too'
    ]
    vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
    vocab_processor.fit(x_text)
    print next(vocab_processor.transform(['i me too'])).tolist()
    x = np.array(list(vocab_processor.fit_transform(x_text)))
    print x
    
    [1, 4, 5, 0]
    [[1 2 3 0]
     [4 5 0 0]]

    文档地址:http://tflearn.org/data_utils/

  • 相关阅读:
    【面积并】 Atlantis
    【动态前k大 贪心】 Gone Fishing
    【复杂枚举】 library
    【双端队列bfs 网格图建图】拯救大兵瑞恩
    【奇偶传递关系 边带权】 奇偶游戏
    【权值并查集】 supermarket
    CF w4d3 A. Pythagorean Theorem II
    CF w4d2 C. Purification
    CF w4d2 B. Road Construction
    CF w4d2 A. Cakeminator
  • 原文地址:https://www.cnblogs.com/bonelee/p/7903934.html
Copyright © 2011-2022 走看看