python网络爬虫 - 走看看

zoukankan html css js c++ java

python网络爬虫
有的时候，我们本来写得好好的爬虫代码，之前还运行得Ok, 一下子突然报错了。

报错信息如下：

Http 800 Internal internet error

这是因为你的对象网站设置了反爬虫程序，如果用现有的爬虫代码，会被拒绝。

之前正常的爬虫代码如下：
from urllib.request import urlopen ... html = urlopen(scrapeUrl) bsObj = BeautifulSoup(html.read(), "html.parser")
这个时候，需要我们给我们的爬虫代码做下伪装，

给它添加表头伪装成是来自浏览器的请求

修改后的代码如下：
import urllib.parse import urllib.request from bs4 import BeautifulSoup ... req = urllib.request.Request(scrapeUrl) req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)') response = urllib.request.urlopen(req) html = response.read() bsObj = BeautifulSoup(html, "html.parser")
Ok,一切搞定，又可以继续爬了。
查看全文

相关阅读:
Nginx优化
 Mysql日常操作
 YUM源
 MySQL5.7安装手册
 自律——可以让我们活的更高级
 javascript中with的用法
 js中所有函数的参数（按值和按引用）都是按值传递的,怎么理解？
base64编码的图片在网页中显示
 form表单提交没有跨域问题，但ajax提交存在跨域问题
 移动端1px的适配问题

原文地址：https://www.cnblogs.com/davidgu/p/5572547.html

Copyright © 2011-2022 走看看