zoukankan      html  css  js  c++  java
  • 解决在使用gensim.models.word2vec.LineSentence加载语料库时报错 UnicodeDecodeError: 'utf-8' codec can't decode byte......的问题

      在window下使用gemsim.models.word2vec.LineSentence加载中文维基百科语料库(已分词)时报如下错误:

    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 0: invalid continuation byte

      这种编码问题真的很让人头疼,这种问题都是出现在xxx.decode("utf-8")的时候,所以接下来我们来看看gensim中的源码:

    class LineSentence(object):
        """Iterate over a file that contains sentences: one line = one sentence.
        Words must be already preprocessed and separated by whitespace.
    
        """
        def __init__(self, source, max_sentence_length=MAX_WORDS_IN_BATCH, limit=None):
            """
    
            Parameters
            ----------
            source : string or a file-like object
                Path to the file on disk, or an already-open file object (must support `seek(0)`).
            limit : int or None
                Clip the file to the first `limit` lines. Do no clipping if `limit is None` (the default).
    
            Examples
            --------
            .. sourcecode:: pycon
    
                >>> from gensim.test.utils import datapath
                >>> sentences = LineSentence(datapath('lee_background.cor'))
                >>> for sentence in sentences:
                ...     pass
    
            """
            self.source = source
            self.max_sentence_length = max_sentence_length
            self.limit = limit
    
        def __iter__(self):
            """Iterate through the lines in the source."""
            try:
                # Assume it is a file-like object and try treating it as such
                # Things that don't have seek will trigger an exception
                self.source.seek(0)
                for line in itertools.islice(self.source, self.limit):
                    line = utils.to_unicode(line).split()
                    i = 0
                    while i < len(line):
                        yield line[i: i + self.max_sentence_length]
                        i += self.max_sentence_length
            except AttributeError:
                # If it didn't work like a file, use it as a string filename
                with utils.smart_open(self.source) as fin:
                    for line in itertools.islice(fin, self.limit):
                        line = utils.to_unicode(line).split()
                        i = 0
                        while i < len(line):
                            yield line[i: i + self.max_sentence_length]
                            i += self.max_sentence_length

      从源码中可以看到__iter__方法让LineSentence成为了一个可迭代的对象,而且文件读取的方法也都定义在__iter__方法中。一般我们输入的source参数都是一个文件路径(也就是一个字符串形式),因此在try时,self.source.seek(0)会报“字符串没有seek方法”的错,所以真正执行的代码是在except中。

      接下来我们有两种方法来解决我们的问题:

      1)from gensim import utils

        utils.samrt_open(url, mode="rb", **kw)

        在源码中用utils.smart_open()方法打开文件时默认是用二进制的形式打开的,可以将mode=“rb” 改成mode=“r”。

      2)from gensim import utils

        utils.to_unicode(text, encoding='utf8', errors='strict')

        在源码中在decode("utf8")时,其默认errors=“strict”, 可以将其改成errors="ignore"。即utils.to_unicode(line, errors="ignore")

      不过建议大家不要直接在源码上修改,可以直接将源码复制下来,例如:

    import logging
    import itertools
    import gensim
    from gensim.models import word2vec
    from gensim import utils
    
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
    
    class LineSentence(object):
        """Iterate over a file that contains sentences: one line = one sentence.
        Words must be already preprocessed and separated by whitespace.
    
        """
        def __init__(self, source, max_sentence_length=10000, limit=None):
            """
    
            Parameters
            ----------
            source : string or a file-like object
                Path to the file on disk, or an already-open file object (must support `seek(0)`).
            limit : int or None
                Clip the file to the first `limit` lines. Do no clipping if `limit is None` (the default).
    
            Examples
            --------
            .. sourcecode:: pycon
    
                >>> from gensim.test.utils import datapath
                >>> sentences = LineSentence(datapath('lee_background.cor'))
                >>> for sentence in sentences:
                ...     pass
    
            """
            self.source = source
            self.max_sentence_length = max_sentence_length
            self.limit = limit
    
        def __iter__(self):
            """Iterate through the lines in the source."""
            try:
                # Assume it is a file-like object and try treating it as such
                # Things that don't have seek will trigger an exception
                self.source.seek(0)
                for line in itertools.islice(self.source, self.limit):
                    line = utils.to_unicode(line).split()
                    i = 0
                    while i < len(line):
                        yield line[i: i + self.max_sentence_length]
                        i += self.max_sentence_length
            except AttributeError:
                # If it didn't work like a file, use it as a string filename
                with utils.smart_open(self.source, mode="r") as fin:
                    for line in itertools.islice(fin, self.limit):
                        line = utils.to_unicode(line).split()
                        i = 0
                        while i < len(line):
                            yield line[i: i + self.max_sentence_length]
                            i += self.max_sentence_length
    
    our_sentences = LineSentence("./zhwiki_token.txt")
    model = gensim.models.Word2Vec(our_sentences, size=200, iter=30)  # 大语料,用CBOW,适当的增大迭代次数
    # model.save(save_model_file)
    model.save("./mathWord2Vec" + ".model")   # 以该形式保存模型以便之后可以继续增量训练
  • 相关阅读:
    TCP和UDP的主要特点
    C++ this和*this的区别
    C++空类中含有哪些默认的函数
    const关键字的用途
    哪些函数不能成为虚函数?
    C++是不是类型安全带的?
    多线程 测试
    多线程 采用三个线程 依次数到75
    多线程 实现控制台打印“我爱你”10遍
    多线程 创建子父线程 保证一件事 子线程执行三次后 父线程执行5次 循环10次
  • 原文地址:https://www.cnblogs.com/jiangxinyang/p/10411595.html
Copyright © 2011-2022 走看看