zoukankan      html  css  js  c++  java
  • Python学习--- requests库中文编码问题

    为什么会有ISO-8859-1这样的字符集编码

        requests会从服务器返回的响应头的 Content-Type 去获取字符集编码,如果content-type有charset字段那么requests才能正确识别编码,否则就使用默认的 ISO-8859-1. 一般那些不规范的页面往往有这样的问题.

    equestsutils.py

    def get_encoding_from_headers(headers):
        """Returns encodings from given HTTP Header Dict.
    
        :param headers: dictionary to extract encoding from.
        :rtype: str
        """
    
        content_type = headers.get('content-type')
    
        if not content_type:
            return None
    
        content_type, params = cgi.parse_header(content_type)
    
        if 'charset' in params:
            return params['charset'].strip("'"")
    
        if 'text' in content_type:
            return 'ISO-8859-1'

    如何获取正确的编码

         requests的返回结果对象里有个apparent_encoding函数, apparent_encoding通过调用chardet.detect()来识别文本编码. 但是需要注意的是,这有些消耗计算资源.

       equestsmodels.py

        @property
        def apparent_encoding(self):
            """The apparent encoding, provided by the chardet library."""
            return chardet.detect(self.content)['encoding']
     

    requests的text() 跟 content() 有什么区别?

        requests在获取网络资源后,我们可以通过两种模式查看内容。 一个是r.text,另一个是r.content,那他们之间有什么区别呢?

    分析requests的源代码发现,r.text返回的是处理过的Unicode型的数据,而使用r.content返回的是bytes型的原始数据。也就是说,r.content相对于r.text来说节省了计算资源,r.content是把内容bytes返回. 而r.text是decode成Unicode. 如果headers没有charset字符集的化,text()会调用chardet来计算字符集,这又是消耗cpu的事情.

    通过看requests代码来分析text() content()的区别.

    # r.text
    @property
        def text(self):
            """Content of the response, in unicode.
    
            If Response.encoding is None, encoding will be guessed using
            ``chardet``.
    
            The encoding of the response content is determined based solely on HTTP
            headers, following RFC 2616 to the letter. If you can take advantage of
            non-HTTP knowledge to make a better guess at the encoding, you should
            set ``r.encoding`` appropriately before accessing this property.
            """
    
            # Try charset from content-type
            content = None
            encoding = self.encoding
    
            if not self.content:
                return str('')
    
            # Fallback to auto-detected encoding.
            if self.encoding is None:
                encoding = self.apparent_encoding
    
            # Decode unicode from given encoding.
            try:
                content = str(self.content, encoding, errors='replace')
            except (LookupError, TypeError):
                # A LookupError is raised if the encoding was not found which could
                # indicate a misspelling or similar mistake.
                #
                # A TypeError can be raised if encoding is None
                #
                # So we try blindly encoding.
                content = str(self.content, errors='replace')
    
            return content
    # Content
    @property
        def content(self):
            """Content of the response, in bytes."""
    
            if self._content is False:
                # Read the contents.
                if self._content_consumed:
                    raise RuntimeError(
                        'The content for this response was already consumed')
    
                if self.status_code == 0 or self.raw is None:
                    self._content = None
                else:
                    self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
    
            self._content_consumed = True
            # don't need to release the connection; that's been handled by urllib3
            # since we exhausted the data.
            return self._content
     

    requests中文乱码解决方法

    方法一: 直接encode成utf-8格式.

    r.content.decode(r.encoding).encode('utf-8')
    r.encoding = 'utf-8'

    方法二:如果headers头部没有charset,那么就从html的meta中抽取.

  • 相关阅读:
    1.4(java学习笔记) 面向对象内存分析
    1.3(java学习笔记)构造方法及重载
    1.2(java学习笔记)类与对象
    1.1(java学习笔记) 面向过程与面向对象
    STM32F0库函数初始化系列:GPIO配置
    STM32F4库函数初始化系列:PWM输出
    STM32F4库函数初始化系列:DMA串口接收
    STM32F4库函数初始化系列:三重ADC——DMA
    STM32F1库函数初始化系列:DMA—ADC采集
    STM32F4库函数初始化系列:串口DMA接收
  • 原文地址:https://www.cnblogs.com/ftl1012/p/9609214.html
Copyright © 2011-2022 走看看