zoukankan html css js c++ java

解决在使用gensim.models.word2vec.LineSentence加载语料库时报错 UnicodeDecodeError: 'utf-8' codec can't decode byte......的问题

　　在window下使用gemsim.models.word2vec.LineSentence加载中文维基百科语料库（已分词）时报如下错误：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 0: invalid continuation byte

　　这种编码问题真的很让人头疼，这种问题都是出现在xxx.decode("utf-8")的时候，所以接下来我们来看看gensim中的源码：

class LineSentence(object):
    """Iterate over a file that contains sentences: one line = one sentence.
    Words must be already preprocessed and separated by whitespace.

    """
    def __init__(self, source, max_sentence_length=MAX_WORDS_IN_BATCH, limit=None):
        """

        Parameters
        ----------
        source : string or a file-like object
            Path to the file on disk, or an already-open file object (must support `seek(0)`).
        limit : int or None
            Clip the file to the first `limit` lines. Do no clipping if `limit is None` (the default).

        Examples
        --------
        .. sourcecode:: pycon

            >>> from gensim.test.utils import datapath
            >>> sentences = LineSentence(datapath('lee_background.cor'))
            >>> for sentence in sentences:
            ...     pass

        """
        self.source = source
        self.max_sentence_length = max_sentence_length
        self.limit = limit

    def __iter__(self):
        """Iterate through the lines in the source."""
        try:
            # Assume it is a file-like object and try treating it as such
            # Things that don't have seek will trigger an exception
            self.source.seek(0)
            for line in itertools.islice(self.source, self.limit):
                line = utils.to_unicode(line).split()
                i = 0
                while i < len(line):
                    yield line[i: i + self.max_sentence_length]
                    i += self.max_sentence_length
        except AttributeError:
            # If it didn't work like a file, use it as a string filename
            with utils.smart_open(self.source) as fin:
                for line in itertools.islice(fin, self.limit):
                    line = utils.to_unicode(line).split()
                    i = 0
                    while i < len(line):
                        yield line[i: i + self.max_sentence_length]
                        i += self.max_sentence_length

　　从源码中可以看到__iter__方法让LineSentence成为了一个可迭代的对象，而且文件读取的方法也都定义在__iter__方法中。一般我们输入的source参数都是一个文件路径（也就是一个字符串形式），因此在try时，self.source.seek(0)会报“字符串没有seek方法”的错，所以真正执行的代码是在except中。

　　接下来我们有两种方法来解决我们的问题：

　　1）from gensim import utils

　　　　utils.samrt_open(url, mode="rb", **kw)

　　　　在源码中用utils.smart_open()方法打开文件时默认是用二进制的形式打开的，可以将mode=“rb” 改成mode=“r”。

　　2）from gensim import utils

　　　　utils.to_unicode(text, encoding='utf8', errors='strict')

　　　　在源码中在decode("utf8")时，其默认errors=“strict”, 可以将其改成errors="ignore"。即utils.to_unicode(line, errors="ignore")

　　不过建议大家不要直接在源码上修改，可以直接将源码复制下来，例如：

import logging
import itertools
import gensim
from gensim.models import word2vec
from gensim import utils

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

class LineSentence(object):
    """Iterate over a file that contains sentences: one line = one sentence.
    Words must be already preprocessed and separated by whitespace.

    """
    def __init__(self, source, max_sentence_length=10000, limit=None):
        """

        Parameters
        ----------
        source : string or a file-like object
            Path to the file on disk, or an already-open file object (must support `seek(0)`).
        limit : int or None
            Clip the file to the first `limit` lines. Do no clipping if `limit is None` (the default).

        Examples
        --------
        .. sourcecode:: pycon

            >>> from gensim.test.utils import datapath
            >>> sentences = LineSentence(datapath('lee_background.cor'))
            >>> for sentence in sentences:
            ...     pass

        """
        self.source = source
        self.max_sentence_length = max_sentence_length
        self.limit = limit

    def __iter__(self):
        """Iterate through the lines in the source."""
        try:
            # Assume it is a file-like object and try treating it as such
            # Things that don't have seek will trigger an exception
            self.source.seek(0)
            for line in itertools.islice(self.source, self.limit):
                line = utils.to_unicode(line).split()
                i = 0
                while i < len(line):
                    yield line[i: i + self.max_sentence_length]
                    i += self.max_sentence_length
        except AttributeError:
            # If it didn't work like a file, use it as a string filename
            with utils.smart_open(self.source, mode="r") as fin:
                for line in itertools.islice(fin, self.limit):
                    line = utils.to_unicode(line).split()
                    i = 0
                    while i < len(line):
                        yield line[i: i + self.max_sentence_length]
                        i += self.max_sentence_length

our_sentences = LineSentence("./zhwiki_token.txt")
model = gensim.models.Word2Vec(our_sentences, size=200, iter=30)  # 大语料，用CBOW，适当的增大迭代次数
# model.save(save_model_file)
model.save("./mathWord2Vec" + ".model")   # 以该形式保存模型以便之后可以继续增量训练

查看全文

相关阅读:
我爱java系列之---【微服务间的认证—Feign拦截器】
我爱java系列之---【设置权限的三种解决方案】
581. Shortest Unsorted Continuous Subarray
129. Sum Root to Leaf Numbers
513. Find Bottom Left Tree Value
515. Find Largest Value in Each Tree Row
155. Min Stack max stack Maxpop O（1）操作
 painting house
Minimum Adjustment Cost
k Sum

原文地址：https://www.cnblogs.com/jiangxinyang/p/10411595.html