zoukankan      html  css  js  c++  java
  • doc2vec使用说明(二)gensim工具包 LabeledSentence

    欢迎交流,转载请注明出处。

    本文介绍gensim工具包中,带标签(一个或者多个)的文档的doc2vec 的向量表示。

    应用场景: 当每个文档不仅可以由文本信息表示,还有别的其他标签信息时,比如,在商品推荐中,将每个商品看成是一个文档,我们想学习商品向量表示时,可以只使用商品的描述信息来学习商品的向量表示,但有时:商品类别等信息我们也想将其考虑进去, 最简单的方法是:当用文本信息学习到商品向量后,添加一维商品的类别信息,但只用一维来表示商品类别信息的有效性差。gensim 工具包的doc2vec提供了更加合理的方法,将商品标签(如类别)加入到商品向量的训练中,即gensim 中的LabeledSentence方法

    LabeledSentence的输入文件格式:每一行为:<labels, words>, 其中labels 可以有多个,用tab 键分隔,words 用空格键分隔,eg:<id  category  I like my cat demon>.

    输出为词典vocabuary 中每个词的向量表示,这样就可以将商品labels:id,类别的向量拼接用作商品的向量表示。

    写了个例子,仅供参考(训练一定要加 min_count=1,否则词典不全,这个小问题卡了一天 Doc2Vec(sentences, size = 100, window = 5, min_count=1))

    注意:下面的例子是gensim更新之前的用法,gensim更新之后,没有了labels 的属性,换为tags, 且目标向量的表示也由vacb转到docvecs 中。更新后gensim 的用法见例子2.

    例子1:gensim 更新前。

     # -*- coding: UTF-8 -*-  
    import gensim, logging
    import os
    from gensim.models.doc2vec import Doc2Vec,LabeledSentence
    from gensim.models import Doc2Vec
    import gensim.models.doc2vec
    
    asin=set()
    category=set()
    class LabeledLineSentence(object):
        def __init__(self, filename=object):
            self.filename =filename
        def __iter__(self):
            with open(self.filename,'r') as infile:
                data=infile.readlines(); 
               # print "length: ", len(data)        
            for uid,line in enumerate(data):  
                asin.add(line.split("	")[0])
                category.add(line.split("	")[1])
                yield LabeledSentence(words=line.split("	")[2].split(), labels=[line.split("	")[0],line.split("	")[1]])
    print 'success'
    
    logging.basicConfig(format = '%(asctime)s : %(levelname)s : %(message)s', level = logging.INFO)
    sentences =LabeledLineSentence('product_bpr_train.txt')
    model = Doc2Vec(sentences, size = 100, window = 5, min_count=1)
    model.save('product_bpr_model.txt')
    print  'success1'
    
    #for uid,line in enumerate(model.vocab):
    #    print line
    print len(model.vocab)
    outid = file('product_bpr_id_vector.txt', 'w')
    outcate = file('product_bpr_cate_vector.txt', 'w')
    for idx, line in enumerate(model.vocab):
        if line in asin :
            outid.write(line +'	')
            for idx,lv in enumerate(model[line]):
                outid.write(str(lv)+" ")
            outid.write('
    ')
        if line in category:
            outcate.write(line + '	')
            for idx,lv in enumerate(model[line]):
                outcate.write(str(lv)+" ")
            outcate.write('
    ')
    outid.close()
    outcate.close()

     例子2:gensim 更新后

     # -*- coding: UTF-8 -*-  
    import gensim, logging
    import os
    from gensim.models.doc2vec import Doc2Vec,LabeledSentence
    from gensim.models import Doc2Vec
    import gensim.models.doc2vec
    
    asin=set()
    category=set()
    class LabeledLineSentence(object):
        def __init__(self, filename=object):
            self.filename =filename
        def __iter__(self):
            with open(self.filename,'r') as infile:
                data=infile.readlines(); 
                print "length: ", len(data)        
            for uid,line in enumerate(data):
                print "line:",line
                asin.add(line.split("	")[0])
                print "asin: ",asin
                category.add(line.split("	")[1])
                yield LabeledSentence(words=line.split("	")[2].split(" "), tags=[line.split("	")[0], line.split("	")[1]])
    print 'success'
    
    logging.basicConfig(format = '%(asctime)s : %(levelname)s : %(message)s', level = logging.INFO)
    sentences =LabeledLineSentence('product_bpr_test_train.txt')
    model = Doc2Vec(sentences, size =50, window = 5, min_count=1)
    model.save('product_bpr_model50.txt')
    print  'success1'
    
    print "doc2vecs length:", len(model.docvecs)
    outid = file('product_bpr_id_vector50.txt', 'w')
    outcate = file('product_bpr_cate_vector50.txt', 'w')
    for id in asin:
        outid.write(id+"	")
        for idx,lv in enumerate(model.docvecs[id]):
            outid.write(str(lv)+" ")
        outid.write("
    ")
    for cate in category:
        outcate.write(cate + '	')
        for idx,lv in enumerate(model.docvecs[cate]):
            outcate.write(str(lv)+" ")
        outcate.write('
    ')
    outid.close()
    outcate.close()

    参考:

    http://rare-technologies.com/doc2vec-tutorial/

    https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb

    http://radimrehurek.com/gensim/models/doc2vec.html#blog

  • 相关阅读:
    poj1113--凸包(Andrew)
    php变量内存完全释放
    php的内存分配还是很智能的
    git 忽略权限记录一下
    开启事务处理插入多条数据 速度也可以
    var_dump(is_writeable(ini_get("session.save_path")));
    要注意一下xss攻击啊
    select *," as A from B union ...
    多次踩坑
    js instanceof运算符
  • 原文地址:https://www.cnblogs.com/baiting/p/5874994.html
Copyright © 2011-2022 走看看