zoukankan html css js c++ java

urlopen打开简书robots.txt时报错：HTTP Error 403: Forbidden

报错代码：

from urllib.robotparser import RobotFileParser
from urllib.request import urlopen

rp = RobotFileParser()
rp.parse(urlopen('https://www.jianshu.com/robots.txt').read().decode('utf-8').split('
'))
print(rp.can_fetch('*', 'https://www.jianshu.com/p/e9eb86a6d120'))
print(rp.can_fetch('*', 'https://www.jianshu.com/u/080bb4eac1c9?utm_source=desktop&utm_medium=index-users'))

报错原因：用urllib.request.urlopen方式打开一个URL，服务器只会收到一个单纯的对于该页面访问的请求，但是服务器并不知道发送这个请求使用的浏览器，操作系统等信息，而缺失这些信息的访问往往都是非正常访问，会被一些网站禁止掉

解决办法：在headers中加入UserAgent

from urllib.robotparser import RobotFileParser
from urllib.request import urlopen, Request

rp = RobotFileParser()
headers = {
    'User-Agent': 'Mozilla/4.0(compatible; MSIE 5.5; Windows NT)'
}
req = Request('https://www.jianshu.com/robots.txt', headers=headers)
rp.parse(urlopen(req).read().decode('utf-8').split('
'))
print(rp.can_fetch('*', 'https://www.jianshu.com/p/e9eb86a6d120'))
print(rp.can_fetch('*', 'https://www.jianshu.com/u/

参考文章

爬取简书robots.txt时遇到的HTTP Error 403: Forbidden问题

查看全文

相关阅读:
Hive的mysql安装配置
 Linux下的MySQL安装
 Hive的安装与基础指令
 浅谈数据库和数据仓库
 Hive的学习之路（理论篇）
Spring---bean的命名
 Spring---单例模式（Singleton）的6种实现
 Spring---加载配置文件的几种方法（org.springframework.beans.factory.BeanDefinitionStoreException）
Spring---配置文件概述
 Spring---Bean生命周期

原文地址：https://www.cnblogs.com/my_captain/p/11032068.html