zoukankan html css js c++ java

路飞学城爬虫学习（一）

爬虫学习（一）

一、所需模块requests

requests 的简介

　　Requests是用python语言基于urllib编写的，采用的是Apache2 Licensed开源协议的HTTP库，Requests它会比urllib更加方便，可以节约我们大量的工作。

安装requests方法

　　在电脑终端 pip install requests,（最好在Python安装目录里安装）。

requests的常用方法

1 获取get请求
2 requests.get(url='xxx')#等价于 requests.request(method='get',url='xxx')
3 获取post请求
4 requests.post(url='xxx')  #等价于requests.request(method='post',url='xxx')

示例：获取get

1 import requests
2 requests.get(
3     url='http://www.oldboyedu.com',
4     params={'nid':1,'name':'x'},  #发送时完整是  http://www.oldboyedu.com？nid=1&name=x
5     headers={},
6     cookies={},
7 )

示例：获取post

1 import requests
2 requests.post(
3     url='x',
4     data={},
5     headers={},
6     cookies={},
7 )

get和post括号里可以跟以下参数

1 url:地址;

2 params:在url中传参;

3 headers：请求头;

4 cookies：cookie;

5 data:数据;

post请求里的data数据可有两种形式的发送：

（1）、

1 requests.post(
2     url='http://www.oldboyedu.com',
3     data={
4         'name':'hahaha',
5         'age':18
6     },
7     headers={},
8     cookies={},
9 )

发送的时候是以name=hahaha&age=18形式发送的。

1 requests.post(
2     url='http://www.oldboyedu.com',
3     data=json.dumps({
4         'name':'hahaha',
5         'age':19
6     }),
7     headers={},
8     cookies={},
9 )

发送的时候是以字符串形式发送的。'{'name':'hahaha','age':19}'

requests的其他请求补充

1 requests.get(url, params=None, **kwargs)
2 requests.post(url, data=None, json=None, **kwargs)
3 requests.put(url, data=None, **kwargs)
4 requests.head(url, **kwargs)
5 requests.delete(url, **kwargs)
6 requests.patch(url, data=None, **kwargs)
7 requests.options(url, **kwargs)

 1 # 以上requests的方法都是基于requests.request()
 2 上述的方法都是在 requests.request()构建而成的：
 3 requests.request()
 4         - method：提交方式，post，get，delete， put， head， patch， options
 5         - url： 提交地址
 6         - params： 在url中传递参数，GET params = {k1:vi}
 7         - data: 在请求体里传递参数用于post请求 data = {k1:v1,k2:v2} or 'k1=v1&k2=v2'
 8         - json: 在请求体里传递参数，并在请求头中设置content-type： application/json
 9         - headers： 在请求头中添加数据
10         - cookies: 网站cookies 在请求头中
11         - files : 文件对象{'f1': open('s1.py', wb), 'f2': ('上传到服务器的文件名', oprn('s1.py', wb))}
12         - auth : 认证使用 在 请求头中加入用户名密码
13         - timeout ： 超时时间
14         - allow_redirects: 是否允许重定向 bool
15         - proxies: 代理
16         = stream: 流,bool   用于下载文件
17             ret = request.get('http://127.0.0.1:8888/test/', steam=True)
18             for i in ret.iter.content():
19                 print(i)
20          - cert: 证书 指定https SSL证书文件
21          - verify = False https忽略证书存在

获取请求结果的几种方式：

1 respone.text # 返回str类型
2 respone.content # 返回字节类型
3 response.encoding # 指定Response的编码
4 response.apparent_encoding # 返回改respones的编码
5 response.cookies,get_dict() # 获取cookie的字典形式

requests的高阶函数应用：

1 request.Session: 自动管理Cookies信息
2 ret = request.Session()
3 ret.get('https://www.baidu.com')

二、beautifulsoup4模块：（用来解析response）

安装bs4:

1 　　pip3 install beautifulsoup4

bs4的简析及使用

　导入方式：

1 from bs4 import BeautifulSoup

find_all 获取所有的匹配的标签

 1 from bs4 import BeautifulSoup
 2 soup = BeautifulSoup(ret.text,'html.parser')  #将前端的字符串解析出来
 3 tags = soup.find_all('a')   #获取列表
 4 print(tags)
 5 
 6 tags = soup.find_all('a',limit=1)
 7 print(tags)
 8 
 9 tags = soup.find_all(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')
10 # tags = soup.find(name='a', class_='sister', recursive=True, text='Lacie')
11 print(tags)
12 
13 
14 # ####### 列表 #######
15 v = soup.find_all(name=['a','div'])
16 print(v)
17 
18 v = soup.find_all(class_=['sister0', 'sister'])
19 print(v)
20 
21 v = soup.find_all(text=['Tillie'])
22 print(v, type(v[0]))
23 
24 
25 v = soup.find_all(id=['link1','link2'])  #同时获取多个ID属性
26 print(v)
27 
28 v = soup.find_all(href=['link1','link2'])
29 print(v)
30 
31 # ####### 正则 #######
32 import re
33 rep = re.compile('p')
34 rep = re.compile('^p')
35 v = soup.find_all(name=rep)
36 print(v)
37 
38 rep = re.compile('sister.*')
39 v = soup.find_all(class_=rep)
40 print(v)
41 
42 rep = re.compile('http://www.oldboy.com/static/.*')
43 v = soup.find_all(href=rep)
44 print(v)
45 
46 # ####### 方法筛选 #######
47 def func(tag):
48     return tag.has_attr('class') and tag.has_attr('id')
49     v = soup.find_all(name=func)
50     print(v)
51 
52 
53 # ## get,获取标签属性
54 tag = soup.find('a')
55 v = tag.get('id')
56 print(v)

get_text 获取标签内部文本内容

1 from bs4 import BeautifulSoup
2 soup = BeautifulSoup(ret.text,'html.parser')
3 tag = soup.find('a')
4 v = tag.get_text('id')
5 print(v)

find 找到匹配的第一个标签

1 tag = soup.find('a')
2 print(tag)
3 tag = soup.find(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')
4 tag = soup.find(name='a', class_='sister', recursive=True, text='Lacie')#这种写法与上面效果一样
5 print(tag)

更多方法可参考

http://www.cnblogs.com/wupeiqi/articles/6283017.html

更多参数官方：http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

三、示例

爬取汽车之家的新闻

给抽屉点一个赞

View Code

给抽屉的指定页面点赞

View Code

查看全文

相关阅读:
ASP.Net GBK 解码
 windows默认共享的打开和关闭？
NET导出Excel遇到的80070005错误的解决方法
 JavaScript(一)
博客开通
 java知识学习05-数据类型、变量、标识符、类型转换
 java知识学习04-注释、关键字、常量
 java知识学习03-第一个程序
 【转】Git 冲突：Your local changes would be overwritten by merge. Commit, stash or revert them to proceed.
idea如何在Git上将分支代码合并到主分支

原文地址：https://www.cnblogs.com/hellowq/p/9293262.html