zoukankan html css js c++ java

# UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b(爬虫) ....

详细错误描述如下：
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

在使用pycurl包进行爬虫的时候，对爬虫的返回的页面，进行写文件或者打印的时候，需要进行解码操作。代码如下：

    import gzip
    import pycurl
    import re
    try:
        from io import BytesIO
    except ImportError:
        from StringIO import StringIO as BytesIO

    headersOfSend=[
    #.....
    "Accept-Encoding: gzip, deflate",
    #......
    "Upgrade-Insecure-Requests: 1",
    "User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36"
    ]

    buffer = BytesIO()
    c = pycurl.Curl()
    c.setopt(c.URL, 'http://xxxxxxxx.com')
    c.setopt(c.WRITEFUNCTION, buffer.write)
    # Set our header function.
    c.setopt(pycurl.HTTPHEADER,headersOfSend)
    c.setopt(c.HEADERFUNCTION, header_function)
    c.perform()
    c.close()
    body = buffer.getvalue()
    print(type(body))
    print(str(body, encoding='utf-8'))#会报错的行

出现原因

headers请求头中，包括：Accept-encoding请求头，请求的响应内容理应是经压缩的数据。这代表本地可以接收压缩格式的数据，而服务器在处理时就将大文件压缩再发回客户端，浏览器在接收完成后在本地对这个文件又进行了解压操作。
出错的原因是因为你的程序没有解压这个文件，所以删掉这行就不会出现问题。

解决方案

方法一：删掉这一行

#"Accept-Encoding: gzip, deflate",

方法二：解码

不删除头里面的字段，用gzip包进行解码

代码一

    body = buffer.getvalue()
    res=gzip.decompress(body).decode("utf-8")
    print(res)

代码二

    body = buffer.getvalue()
    buff = BytesIO(body)
    f=gzip.GzipFile(fileobj=buff)
    # Decode using the encoding we figured out.
    htmls = f.read().decode(encoding)
    print(type(htmls))
    print(dir(htmls))

代码三

    buffer.seek(0,0)
    f=gzip.GzipFile(fileobj=buffer)
    # Decode using the encoding we figured out.
    htmls = f.read().decode(encoding)
    print(htmls)

查看全文

相关阅读:
Linux系统常用工具集
 Storm安装部署
 Linux下搭建Elasticsearch7.6.2集群
 解决SpringMVC @RequestBody无法注入基本数据类型
 微服务概念
 HashMap的原理简单介绍
 mysql进阶
 vue 路由缓存 keep-alive include和exclude无效
 el-date-picker 时间日期格式，选择范围限制
 RedisTemplate使用rightPushAll往list中添加时的注意事项

原文地址：https://www.cnblogs.com/meiguhuaxian/p/14167009.html