爬虫基础:requests以及BeautifulSoup模块
http://www.cnblogs.com/wupeiqi/articles/6283017.html
爬虫性能相关以及Scrapy框架
http://www.cnblogs.com/wupeiqi/articles/6283017.html
Python开发【第十五篇】:Web框架之Tornado
http://www.cnblogs.com/wupeiqi/articles/5702910.html
200行自定义异步非阻塞Web框架
http://www.cnblogs.com/wupeiqi/p/6536518.html
模块
requests.get(url='URL路径)
beautifulsoup
soup = beautifulsoup('HTML格式字符串',‘html.parser’)
tag = soup.find(name='div',attrs = {'id':'t'})
tags = soup.find_all(name='div',attrs = {'id':'t'})
tag.find('h3).text
tag.find('h3').get('属性名称') #get('href')
tag.find('h3').attrs
Http请求基础
requests
GET:
requests.get(url="http://www.oldboyedu.com")
# data="http GET / http1.1
host:oldboyedu.com
....
"
requests.get(url="http://www.oldboyedu.com/index.html?p=1")
# data="http GET /index.html?p=1 http1.1
host:oldboyedu.com
....
"
requests.get(url="http://www.oldboyedu.com/index.html",params={'p':1})
# data="http GET /index.html?p=1 http1.1
host:oldboyedu.com
....
"
POST:
requests.post(url="http://www.oldboyedu.com",data={'name':'alex','age':18}) # 默认请求头:url-formend....
data="http POST / http1.1
host:oldboyedu.com
....
name=alex&age=18"
requests.post(url="http://www.oldboyedu.com",json={'name':'alex','age':18}) # 默认请求头:application/json
data="http POST / http1.1
host:oldboyedu.com
....
{"name": "alex", "age": 18}"
requests.post(
url="http://www.oldboyedu.com",
params={'p':1},
json={'name':'alex','age':18}
) # 默认请求头:application/json
GET请求
有参数实例:
1 import requests 2 3 payload = {'key1':'v1',‘key2’:‘v2’} 4 5 ret = requests.get("http://test.cn/get",params=payload) 6 7 print(ret.url) 8 9 print(ret.text)
POST请求
import
requests
import
json
url
=
'https://api.github.com/some/endpoint'
payload
=
{
'some'
:
'data'
}
headers
=
{
'content-type'
:
'application/json'
}
ret
=
requests.post(url, data
=
json.dumps(payload), headers
=
headers)
print
ret.text
print
ret.cookies
其他请求
1. method
2. url
3. params
4. data
5. json
6. headers
7. cookies
8. files
9. auth
10. timeout
11. allow_redirects
12. proxies
13. stream
14. cert
requests.request(method='POST', url='http://127.0.0.1:8000/test/', data=open('data_file.py', mode='r', encoding='utf-8'), # 文件内容是:k1=v1;k2=v2;k3=v3;k3=v4 headers={'Content-Type': 'application/x-www-form-urlencoded'} )
def param_cookies(): # 发送Cookie到服务器端 requests.request(method='POST', url='http://127.0.0.1:8000/test/', data={'k1': 'v1', 'k2': 'v2'}, cookies={'cook1': 'value1'}, )
def param_auth():
from requests.auth import HTTPBasicAuth, HTTPDigestAuth ret = requests.get('https://api.github.com/user', auth=HTTPBasicAuth('wupeiqi', 'sdfasdfasdf')) print(ret.text)
BeautifulSoup模块
该模块用于接收一个HTML或XML字符串,然后将其格式化,之后可以使用他提供的方法进行快速查找指定元素,从而使得在HTML或XML中查找指定元素变得简单。
安装:
pip3 install beautifulsoup4
使用实例:
1 from bs4 import BeautifulSoup 2 3 html_doc = """ 4 <html> 5 <head> 6 <title>The Dormouse's story</title> 7 </head> 8 <body> 9 <div id='i1'> 10 <a>sdfs</a> 11 </div> 12 <p class='c2'>asdfa</p> 13 </body> 14 </html> 15 """
具体方法:
1.name,标签名称
1 soup = BeautifulSoup(html_doc,'html.parser') 2 tag = soup.find('a') 3 name = tag.name #获取 4 print(name) 5 tag.name = 'span' #设置 6 print(soup)
2.attrs,标签属性
soup = BeautifulSoup(html_doc,'html.parser') tag = soup.find('a') attrs = tag.attrs #获取 print(attrs) tag.attrs['id'] = 'iiii' #设置 print(soup) attrs = tag.attrs #获取 print(attrs)
3.children,所有子标签
body = soup.find('body') v = body.children print(list(v))
4.descendants,找所有后代
body = soup.find('body') v = body.descendants print(list(v))
5.clear,将标签的所有子标签全部清空(保留标签名)
tag = soup.find('body') tag.clear() print(soup)
6.decompose,递归的删除所有标签
tag = soup.find('body') tag.decompose() print(soup)
7.extract,递归删除所有的标签,并获取删除的标签
tag = soup.find('body') tag.extract() print(soup)
8.decode,转换为字符串(含当前字符串);decode_contents(不含当前标签)
tag = soup.find('body') v = tag.decode() #对象变成字符串 v1 = tag.encode() #对象变成字节 print(v,v1)
9.find ,获取匹配的第一个标签 find_all,获取匹配的所有标签
soup = BeautifulSoup(html_doc,'html.parser') tag = soup.find('a') tags = soup.find_all('body') for i in tags: print(i)
v = soup.find_all(name=['a','div'])
print(v)
10.has_attr,检查标签是否具有该属性
soup = BeautifulSoup(html_doc,'html.parser') tag = soup.find('a') v = tag.has_attr('class') v1 = tag.has_attr('id') print(v,v1)
11.get_text,获取标签内部文本内容
soup = BeautifulSoup(html_doc,'html.parser') tag = soup.find('a') v = tag.get_text() print(v)
其他标签:
index,检查标签在某标签中的索引位置
is_empty_element,是否是空标签(是否可以是空)或者自闭合标签
select,select_one, CSS选择器
soup.select(
"body a"
)
标签内容:
print(tag.string)
tag.string = 'new content' # 设置
append在当前标签内部追加一个标签
insert在当前标签内部指定位置插入一个标签
insert_after,insert_before 在当前标签后面或前面插入
replace_with 在当前标签替换为指定标签
wrap,将指定标签把当前标签包裹起来
unwrap,去掉当前标签,将保留其包裹的标签