zoukankan      html  css  js  c++  java
  • 给Python中通过urllib2.urlopen获取网页的过程中,添加gzip的压缩与解压缩支持

    之前已经实现了用Python获取网页的内容,相关已实现代码为:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    #------------------------------------------------------------------------------
    # get response from url
    # note: if you have already used cookiejar, then here will automatically use it
    # while using rllib2.Request
    def getUrlResponse(url, postDict={}, headerDict={}) :
        # makesure url is string, not unicode, otherwise urllib2.urlopen will error
        url = str(url);
     
        if (postDict) :
            postData = urllib.urlencode(postDict);
            req = urllib2.Request(url, postData);
            req.add_header('Content-Type'"application/x-www-form-urlencoded");
        else :
            req = urllib2.Request(url);
     
        if(headerDict) :
            print "added header:",headerDict;
            for key in headerDict.keys() :
                req.add_header(key, headerDict[key]);
     
        req.add_header('User-Agent', gConst['userAgentIE9']);
        req.add_header('Cache-Control''no-cache');
        req.add_header('Accept''*/*');
        #req.add_header('Accept-Encoding', 'gzip, deflate');
        req.add_header('Connection''Keep-Alive');
        resp = urllib2.urlopen(req);
         
        return resp;
     
    #------------------------------------------------------------------------------
    # get response html==body from url
    def getUrlRespHtml(url, postDict={}, headerDict={}) :
        resp = getUrlResponse(url, postDict, headerDict);
        respHtml = resp.read();
        return respHtml;

    其中,是不支持html的压缩已解压缩的。

    现在想要支持相关的压缩与解压缩。

    其中,关于这部分内容,之前就已经通过C#实现了对应的功能,了解了对应的逻辑。所以,此处主要是具体是如何用python实现而已,对于内部机制,基本已经了解过了。

    【解决过程】

    1.之前就简单找过相关的帖子看,但是当时没来得及解决。

    现在知道了,是要先对http的request添加gzip的header的,具体python代码是:

    req.add_header(‘Accept-Encoding’, ‘gzip, deflate’);

    然后返回的http的response中,read所得到的数据,就是gzip后的压缩的数据了。

    接下来就是想要搞懂,如何将其解压出来。

    2.先去找了下gzip的解释,发现python官方文档中,是这样说的:

    12.2. gzip — Support for gzip files

    This module provides a simple interface to compress and decompress files just like the GNU programs gzip and gunzip would.

    The data compression is provided by the zlib module.

    即,gzip是针对文件来压缩与解压缩的。,而对于数据压缩与解压,是用zlib。

    所以又去查看zlib:

    zlib.decompress(string[, wbits[, bufsize]])

    Decompresses the data in string, returning a string containing the uncompressed data. Thewbits parameter controls the size of the window buffer, and is discussed further below. Ifbufsize is given, it is used as the initial size of the output buffer. Raises the error exception if any error occurs.

    The absolute value of wbits is the base two logarithm of the size of the history buffer (the “window size”) used when compressing data. Its absolute value should be between 8 and 15 for the most recent versions of the zlib library, larger values resulting in better compression at the expense of greater memory usage. When decompressing a stream, wbits must not be smaller than the size originally used to compress the stream; using a too-small value will result in an exception. The default value is therefore the highest value, 15. When wbits is negative, the standard gzip header is suppressed.

    bufsize is the initial size of the buffer used to hold decompressed data. If more space is required, the buffer size will be increased as needed, so you don’t have to get this value exactly right; tuning it will only save a few calls to malloc(). The default size is 16384.

    然后程序中直接去用:zlib.decompress,结果出错,后来解决了,具体过程见:

    【已解决】Python中用zlib.decompress出错:error: Error -3 while decompressing data: incorrect header check

    然后,就可以实现将返回的html解压了。

    3.参考了这里:

    http://flyash.itcao.com/post_1117.html

    才知道可以去判断其中返回的http的response中,是否包含Content-Encoding: gzip,然后再决定是否去调用zlib去解压缩的。

    4.最后实现了对应的全部代码,如下:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    #------------------------------------------------------------------------------
    # get response from url
    # note: if you have already used cookiejar, then here will automatically use it
    # while using rllib2.Request
    def getUrlResponse(url, postDict={}, headerDict={}, timeout=0, useGzip=False) :
        # makesure url is string, not unicode, otherwise urllib2.urlopen will error
        url = str(url);
     
        if (postDict) :
            postData = urllib.urlencode(postDict);
            req = urllib2.Request(url, postData);
            req.add_header('Content-Type'"application/x-www-form-urlencoded");
        else :
            req = urllib2.Request(url);
     
        if(headerDict) :
            #print "added header:",headerDict;
            for key in headerDict.keys() :
                req.add_header(key, headerDict[key]);
     
        defHeaderDict = {
            'User-Agent'    : gConst['userAgentIE9'],
            'Cache-Control' 'no-cache',
            'Accept'        '*/*',
            'Connection'    'Keep-Alive',
        };
     
        # add default headers firstly
        for eachDefHd in defHeaderDict.keys() :
            #print "add default header: %s=%s"%(eachDefHd,defHeaderDict[eachDefHd]);
            req.add_header(eachDefHd, defHeaderDict[eachDefHd]);
     
        if(useGzip) :
            #print "use gzip for",url;
            req.add_header('Accept-Encoding''gzip, deflate');
     
        # add customized header later -> allow overwrite default header
        if(headerDict) :
            #print "added header:",headerDict;
            for key in headerDict.keys() :
                req.add_header(key, headerDict[key]);
     
        if(timeout > 0) :
            # set timeout value if necessary
            resp = urllib2.urlopen(req, timeout=timeout);
        else :
            resp = urllib2.urlopen(req);
         
        return resp;
     
    #------------------------------------------------------------------------------
    # get response html==body from url
    #def getUrlRespHtml(url, postDict={}, headerDict={}, timeout=0, useGzip=False) :
    def getUrlRespHtml(url, postDict={}, headerDict={}, timeout=0, useGzip=True) :
        resp = getUrlResponse(url, postDict, headerDict, timeout, useGzip);
        respHtml = resp.read();
        if(useGzip) :
            #print "---before unzip, len(respHtml)=",len(respHtml);
            respInfo = resp.info();
             
            # Server: nginx/1.0.8
            # Date: Sun, 08 Apr 2012 12:30:35 GMT
            # Content-Type: text/html
            # Transfer-Encoding: chunked
            # Connection: close
            # Vary: Accept-Encoding
            # ...
            # Content-Encoding: gzip
             
            # sometime, the request use gzip,deflate, but actually returned is un-gzip html
            # -> response info not include above "Content-Encoding: gzip"
            # -> so here only decode when it is indeed is gziped data
            if( ("Content-Encoding" in respInfo) and (respInfo['Content-Encoding'== "gzip")) :
                respHtml = zlib.decompress(respHtml, 16+zlib.MAX_WBITS);
                #print "+++ after unzip, len(respHtml)=",len(respHtml);
     
        return respHtml;

     

    【总结】

    关于给python中的urllib2.urlopen添加gzip支持,其中主要逻辑就是:

    1. 给request添加对应的gzip的header:

    req.add_header(‘Accept-Encoding’, ‘gzip, deflate’);

    2. 然后获得返回的html后,用zlib对其解压缩:

    respHtml = zlib.decompress(respHtml, 16+zlib.MAX_WBITS);

    其中解压缩之前,先要判断返回的内容,是否是真正的gzip后的数据,即“Content-Encoding: gzip”,因为可能出现你的http的请求中支持其返回gzip的数据,但是其返回的是原始的没有用gzip压缩的html数据。

  • 相关阅读:
    回顾[2007-09-03 12:58:03]
    关于知音[2007-08-17 20:56:06]
    今天晚上吃散伙饭[2007-06-18 00:24:36]
    上次所料不错[2007-06-13 15:44:47]
    今天真没劲[2007-06-10 17:50:25]
    关于昨晚的梦[2007-05-07 12:12:06]
    iOS 自定义键盘
    iOSQuart2D绘图之UIImage简单使用
    iOS 两种不同的图片无限轮播
    iOS 简单引导界面
  • 原文地址:https://www.cnblogs.com/thewindkee/p/12873222.html
Copyright © 2011-2022 走看看