zoukankan html css js c++ java

python爬虫---urllib库的基本用法

urllib是python自带的请求库，各种功能相比较之下也是比较完备的，urllib库包含了一下四个模块：

urllib.request 请求模块

urllib.error 异常处理模块

urllib.parse url解析模块

urllib.robotparse robots.txt解析模块

下面是一些urllib库的使用方法。

使用urllib.request

import urllib.request

response = urllib.request.urlopen('http://www.bnaidu.com')
print(response.read().decode('utf-8'))

使用read()方法打印网页的HTML，read出来的是字节流,需要decode一下

import urllib.request

response = urllib.request.urlopen('http://www.baidu.com')
print(response.status) #打印状态码信息  其方法和response.getcode() 一样  都是打印当前response的状态码
print(response.getheaders()) #打印出响应的头部信息，内容有服务器类型，时间、文本内容、连接状态等等
print(response.getheader('Server'))  #这种拿到响应头的方式需要加上参数，指定你想要获取的头部中那一条数据
print(response.geturl())  #获取响应的url  
print(response.read())#使用read()方法得到响应体内容，这时是一个字节流bytes，看到明文还需要decode为charset格式

为一个请求添加请求头，伪装为浏览器

1.在请求时就加上请求头参数

import urllib.request
import urllib.parse

url = 'http://httpbin.org/post'
header = {}
header['User-Agent'] = 'Mozilla/5.0 ' 
                          '(Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 ' 
                          '(KHTML, like Gecko) Version/5.1 Safari/534.50'

req = urllib.request.Request(url=url, headers=header)
res = urllib.request.urlopen(req)

Request是一个请求类，在构造时将headers以参数形式加入到请求中

2.使用动态追加headers的方法

若要使用动态追加的方法，必须实例化Request这个类

import urllib.request
import urllib.parse

url = 'http://httpbin.org/post'

req = urllib.request.Request(url=url)
req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0')
res = urllib.request.urlopen(req)

使用代理：

ProxyHandler是urllib.request下的一个类，借助这个类可以构造代理请求

参数为一个dict形式的，key对应着类型，IP，端口

import urllib.request

proxy_handler = urllib.request.ProxyHandler({
    'http':'112.35.29.53:8088',
    'https':'165.227.169.12:80'
})
opener = urllib.request.build_opener(proxy_handler)
response = opener.open('http://www.baidu.com')
print(response.read())

urllib.parse的用法

import urllib.request
import urllib.parse

url = 'http://httpbin.org/post'
header = {}
header['User-Agent'] = 'Mozilla/5.0 ' 
                          '(Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 ' 
                          '(KHTML, like Gecko) Version/5.1 Safari/534.50'
        
data = {}
data['name'] = 'us'
data = urllib.parse.urlencode(data).encode('utf-8')
req = urllib.request.Request(url=url, data=data, headers=header, method='POST')
response = urllib.request.urlopen(req)
print(response.read().decode('utf-8'))
print(type(data))

urllib这个库很坑，建议直接弃用，上个月我用urllib写好的代码，现在运行起来各种问题

所以使用requests库吧，超简洁的语法方法。

查看全文

相关阅读:
《python编程从入门到实践》变量和简单数据类型
 《初学python》
centos7 关闭防火墙
 记几个学习资源
 servlet-api.jar
spring 对Map的一种扩展 MultiValueMap
CPU飚高问题解决
 聊聊数据库优化
 netty的核心组件
 【转】一个著名的日志系统是怎么设计出来的？

原文地址：https://www.cnblogs.com/mzc1997/p/7813786.html

最新文章
记录不同单词数目
 AC自动化
 花布条错误案例
 kmp
亲和串。。。错误案例
 KMP算法
 并查集
 JS 闭包
 JS 宏任务与微任务
 百度 Ueditor 使用及规则