zoukankan      html  css  js  c++  java
  • 解决在使用gensim.models.word2vec.LineSentence加载语料库时报错 UnicodeDecodeError: 'utf-8' codec can't decode byte......的问题

      在window下使用gemsim.models.word2vec.LineSentence加载中文维基百科语料库(已分词)时报如下错误:

    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 0: invalid continuation byte

      这种编码问题真的很让人头疼,这种问题都是出现在xxx.decode("utf-8")的时候,所以接下来我们来看看gensim中的源码:

    class LineSentence(object):
        """Iterate over a file that contains sentences: one line = one sentence.
        Words must be already preprocessed and separated by whitespace.
    
        """
        def __init__(self, source, max_sentence_length=MAX_WORDS_IN_BATCH, limit=None):
            """
    
            Parameters
            ----------
            source : string or a file-like object
                Path to the file on disk, or an already-open file object (must support `seek(0)`).
            limit : int or None
                Clip the file to the first `limit` lines. Do no clipping if `limit is None` (the default).
    
            Examples
            --------
            .. sourcecode:: pycon
    
                >>> from gensim.test.utils import datapath
                >>> sentences = LineSentence(datapath('lee_background.cor'))
                >>> for sentence in sentences:
                ...     pass
    
            """
            self.source = source
            self.max_sentence_length = max_sentence_length
            self.limit = limit
    
        def __iter__(self):
            """Iterate through the lines in the source."""
            try:
                # Assume it is a file-like object and try treating it as such
                # Things that don't have seek will trigger an exception
                self.source.seek(0)
                for line in itertools.islice(self.source, self.limit):
                    line = utils.to_unicode(line).split()
                    i = 0
                    while i < len(line):
                        yield line[i: i + self.max_sentence_length]
                        i += self.max_sentence_length
            except AttributeError:
                # If it didn't work like a file, use it as a string filename
                with utils.smart_open(self.source) as fin:
                    for line in itertools.islice(fin, self.limit):
                        line = utils.to_unicode(line).split()
                        i = 0
                        while i < len(line):
                            yield line[i: i + self.max_sentence_length]
                            i += self.max_sentence_length

      从源码中可以看到__iter__方法让LineSentence成为了一个可迭代的对象,而且文件读取的方法也都定义在__iter__方法中。一般我们输入的source参数都是一个文件路径(也就是一个字符串形式),因此在try时,self.source.seek(0)会报“字符串没有seek方法”的错,所以真正执行的代码是在except中。

      接下来我们有两种方法来解决我们的问题:

      1)from gensim import utils

        utils.samrt_open(url, mode="rb", **kw)

        在源码中用utils.smart_open()方法打开文件时默认是用二进制的形式打开的,可以将mode=“rb” 改成mode=“r”。

      2)from gensim import utils

        utils.to_unicode(text, encoding='utf8', errors='strict')

        在源码中在decode("utf8")时,其默认errors=“strict”, 可以将其改成errors="ignore"。即utils.to_unicode(line, errors="ignore")

      不过建议大家不要直接在源码上修改,可以直接将源码复制下来,例如:

    import logging
    import itertools
    import gensim
    from gensim.models import word2vec
    from gensim import utils
    
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
    
    class LineSentence(object):
        """Iterate over a file that contains sentences: one line = one sentence.
        Words must be already preprocessed and separated by whitespace.
    
        """
        def __init__(self, source, max_sentence_length=10000, limit=None):
            """
    
            Parameters
            ----------
            source : string or a file-like object
                Path to the file on disk, or an already-open file object (must support `seek(0)`).
            limit : int or None
                Clip the file to the first `limit` lines. Do no clipping if `limit is None` (the default).
    
            Examples
            --------
            .. sourcecode:: pycon
    
                >>> from gensim.test.utils import datapath
                >>> sentences = LineSentence(datapath('lee_background.cor'))
                >>> for sentence in sentences:
                ...     pass
    
            """
            self.source = source
            self.max_sentence_length = max_sentence_length
            self.limit = limit
    
        def __iter__(self):
            """Iterate through the lines in the source."""
            try:
                # Assume it is a file-like object and try treating it as such
                # Things that don't have seek will trigger an exception
                self.source.seek(0)
                for line in itertools.islice(self.source, self.limit):
                    line = utils.to_unicode(line).split()
                    i = 0
                    while i < len(line):
                        yield line[i: i + self.max_sentence_length]
                        i += self.max_sentence_length
            except AttributeError:
                # If it didn't work like a file, use it as a string filename
                with utils.smart_open(self.source, mode="r") as fin:
                    for line in itertools.islice(fin, self.limit):
                        line = utils.to_unicode(line).split()
                        i = 0
                        while i < len(line):
                            yield line[i: i + self.max_sentence_length]
                            i += self.max_sentence_length
    
    our_sentences = LineSentence("./zhwiki_token.txt")
    model = gensim.models.Word2Vec(our_sentences, size=200, iter=30)  # 大语料,用CBOW,适当的增大迭代次数
    # model.save(save_model_file)
    model.save("./mathWord2Vec" + ".model")   # 以该形式保存模型以便之后可以继续增量训练
  • 相关阅读:
    我爱java系列之---【微服务间的认证—Feign拦截器】
    我爱java系列之---【设置权限的三种解决方案】
    581. Shortest Unsorted Continuous Subarray
    129. Sum Root to Leaf Numbers
    513. Find Bottom Left Tree Value
    515. Find Largest Value in Each Tree Row
    155. Min Stack max stack Maxpop O(1) 操作
    painting house
    Minimum Adjustment Cost
    k Sum
  • 原文地址:https://www.cnblogs.com/jiangxinyang/p/10411595.html
Copyright © 2011-2022 走看看