zoukankan      html  css  js  c++  java
  • python Requests库在处理response时的一些陷阱

    python的Requests(http://docs.python-requests.org/en/latest/)库在处理http/https请求时还是比较方便的,应用也比较广泛。
    但其在处理response时有一些地方需要特别注意,简单来说就是Response对象的content方法和text方法的区别,具体代码如下:

    @property
        def content(self):
            """Content of the response, in bytes."""
    
            if self._content is False:
                # Read the contents.
                try:
                    if self._content_consumed:
                        raise RuntimeError(
                            'The content for this response was already consumed')
    
                    if self.status_code == 0:
                        self._content = None
                    else:
                        self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
    
                except AttributeError:
                    self._content = None
    
            self._content_consumed = True
            # don't need to release the connection; that's been handled by urllib3
            # since we exhausted the data.
            return self._content
    
        @property
        def text(self):
            """Content of the response, in unicode.
    
            if Response.encoding is None and chardet module is available, encoding
            will be guessed.
            """
    
            # Try charset from content-type
            content = None
            encoding = self.encoding
    
            if not self.content:
                return str('')
    
            # Fallback to auto-detected encoding.
            if self.encoding is None:
                encoding = self.apparent_encoding
    
            # Decode unicode from given encoding.
            try:
                content = str(self.content, encoding, errors='replace')
            except (LookupError, TypeError):
                # A LookupError is raised if the encoding was not found which could
                # indicate a misspelling or similar mistake.
                #
                # A TypeError can be raised if encoding is None
                #
                # So we try blindly encoding.
                content = str(self.content, errors='replace')
    
            return content
       @property
        def apparent_encoding(self):
            """The apparent encoding, provided by the lovely Charade library
            (Thanks, Ian!)."""
            return chardet.detect(self.content)['encoding']

    可以看出text方法中对原始数据做了编码操作
    其中response的encoding属性是在adapters.py中的HTTPAdapter中的build_response中进行赋值,具体代码如下:

    def build_response(self, req, resp):
            """Builds a :class:`Response <requests.Response>` object from a urllib3
            response. This should not be called from user code, and is only exposed
            for use when subclassing the
            :class:`HTTPAdapter <requests.adapters.HTTPAdapter>`
    
            :param req: The :class:`PreparedRequest <PreparedRequest>` used to generate the response.
            :param resp: The urllib3 response object.
            """
            response = Response()
    
            # Fallback to None if there's no status_code, for whatever reason.
            response.status_code = getattr(resp, 'status', None)
    
            # Make headers case-insensitive.
            response.headers = CaseInsensitiveDict(getattr(resp, 'headers', {}))
    
            # Set encoding.
            response.encoding = get_encoding_from_headers(response.headers)
            response.raw = resp
            response.reason = response.raw.reason
    
            if isinstance(req.url, bytes):
                response.url = req.url.decode('utf-8')
            else:
                response.url = req.url
    
            # Add new cookies from the server.
            extract_cookies_to_jar(response.cookies, req, resp)
    
            # Give the Response some context.
            response.request = req
            response.connection = self
    
            return response

    从上述代码(response.encoding = get_encoding_from_headers(response.headers))中可以看出,具体的encoding是通过解析headers得到的,

    def get_encoding_from_headers(headers):
        """Returns encodings from given HTTP Header Dict.
    
        :param headers: dictionary to extract encoding from.
        """
    
        content_type = headers.get('content-type')
    
        if not content_type:
            return None
    
        content_type, params = cgi.parse_header(content_type)
    
        if 'charset' in params:
            return params['charset'].strip("'"")
    
        if 'text' in content_type:
            return 'ISO-8859-1'

    为避免Requests采用chardet去猜测response的编码,请慎用text属性,直接使用content属性即可,再根据实际需要进行编码。
    对于服务端没有显式指明charset的response来说,采用text和content的差别如下所示:
    代码:

        print time.time()
        print 'begin request'
        r = requests.get(r'http://www.sina.com.cn')
        # erase response encoding
        r.encoding = None
        r.text
        #r.content
        print 'request end'
        print time.time()

    采用text时的耗时:

    采用content时的耗时:




  • 相关阅读:
    自增自减
    字符串处理函数
    指针总结指向const的指针、const指针、指向const指针的const指针
    Jenkins基础篇 系列之-—04 节点配置
    soapUI系列之—-07 调用JIRA Rest API接口【例】
    测试人员转正考核方向
    linux系列之-—03 常见问题
    Jenkins基础篇 系列之-—03 配置用户和权限控制
    linux系列之-—01 shell编程笔记
    Jenkins基础篇 系列之-—02 修改Jenkins用户的密码
  • 原文地址:https://www.cnblogs.com/Jerryshome/p/3272748.html
Copyright © 2011-2022 走看看