爬虫相关

zoukankan html css js c++ java

爬虫相关
爬虫基础：requests以及BeautifulSoup模块
http://www.cnblogs.com/wupeiqi/articles/6283017.html

爬虫性能相关以及Scrapy框架
http://www.cnblogs.com/wupeiqi/articles/6283017.html

Python开发【第十五篇】：Web框架之Tornado
http://www.cnblogs.com/wupeiqi/articles/5702910.html

200行自定义异步非阻塞Web框架
http://www.cnblogs.com/wupeiqi/p/6536518.html

模块

requests.get(url='URL路径)

beautifulsoup

soup = beautifulsoup('HTML格式字符串'，‘html.parser’)

tag = soup.find(name='div',attrs = {'id':'t'})

tags = soup.find_all(name='div',attrs = {'id':'t'})

tag.find('h3).text

tag.find('h3').get('属性名称') #get('href')

tag.find('h3').attrs

Http请求基础

requests
GET:
requests.get(url="http://www.oldboyedu.com")
# data="http GET / http1.1 host:oldboyedu.com .... "

requests.get(url="http://www.oldboyedu.com/index.html?p=1")
# data="http GET /index.html?p=1 http1.1 host:oldboyedu.com .... "

requests.get(url="http://www.oldboyedu.com/index.html",params={'p':1})
# data="http GET /index.html?p=1 http1.1 host:oldboyedu.com .... "

POST:
requests.post(url="http://www.oldboyedu.com",data={'name':'alex','age':18}) # 默认请求头：url-formend....
data="http POST / http1.1 host:oldboyedu.com .... name=alex&age=18"

requests.post(url="http://www.oldboyedu.com",json={'name':'alex','age':18}) # 默认请求头：application/json
data="http POST / http1.1 host:oldboyedu.com .... {"name": "alex", "age": 18}"

requests.post(
url="http://www.oldboyedu.com",
params={'p':1},
json={'name':'alex','age':18}
) # 默认请求头：application/json

GET请求

有参数实例：
1 import requests 2 3 payload = {'key1':'v1'，‘key2’：‘v2’} 4 5 ret = requests.get("http://test.cn/get"，params=payload) 6 7 print(ret.url) 8 9 print(ret.text)
POST请求

import requests

import json



url = 'https://api.github.com/some/endpoint'

payload = {'some': 'data'}

headers = {'content-type': 'application/json'}



ret = requests.post(url, data=json.dumps(payload), headers=headers)



print ret.text

print ret.cookies

其他请求

1. method
2. url
3. params
4. data
5. json
6. headers
7. cookies
8. files
9. auth
10. timeout
11. allow_redirects
12. proxies
13. stream
14. cert
```
requests.request(method='POST',
url='http://127.0.0.1:8000/test/',
data=open('data_file.py', mode='r', encoding='utf-8'), # 文件内容是：k1=v1;k2=v2;k3=v3;k3=v4
 headers={'Content-Type': 'application/x-www-form-urlencoded'}
 )
```
def param_cookies(): # 发送Cookie到服务器端 requests.request(method='POST', url='http://127.0.0.1:8000/test/', data={'k1': 'v1', 'k2': 'v2'}, cookies={'cook1': 'value1'}, )

def param_auth():

from requests.auth import HTTPBasicAuth, HTTPDigestAuth ret = requests.get('https://api.github.com/user', auth=HTTPBasicAuth('wupeiqi', 'sdfasdfasdf')) print(ret.text)
BeautifulSoup模块

该模块用于接收一个HTML或XML字符串，然后将其格式化，之后可以使用他提供的方法进行快速查找指定元素，从而使得在HTML或XML中查找指定元素变得简单。

安装：

pip3 install beautifulsoup4

使用实例：
1 from bs4 import BeautifulSoup 2 3 html_doc = """ 4 <html> 5 <head> 6 <title>The Dormouse's story</title> 7 </head> 8 <body> 9 <div id='i1'> 10 <a>sdfs</a> 11 </div> 12 <p class='c2'>asdfa</p> 13 </body> 14 </html> 15 """
具体方法：

1.name，标签名称
1 soup = BeautifulSoup(html_doc,'html.parser') 2 tag = soup.find('a') 3 name = tag.name #获取 4 print(name) 5 tag.name = 'span' #设置 6 print(soup)
2.attrs，标签属性
soup = BeautifulSoup(html_doc,'html.parser') tag = soup.find('a') attrs = tag.attrs #获取 print(attrs) tag.attrs['id'] = 'iiii' #设置 print(soup) attrs = tag.attrs #获取 print(attrs)
3.children，所有子标签
body = soup.find('body') v = body.children print(list(v))
4.descendants，找所有后代　
body = soup.find('body') v = body.descendants print(list(v))
5.clear，将标签的所有子标签全部清空（保留标签名）
tag = soup.find('body') tag.clear() print(soup)
6.decompose，递归的删除所有标签
tag = soup.find('body') tag.decompose() print(soup)
7.extract，递归删除所有的标签，并获取删除的标签
tag = soup.find('body') tag.extract() print(soup)
8.decode，转换为字符串（含当前字符串）；decode_contents(不含当前标签）
tag = soup.find('body') v = tag.decode() #对象变成字符串 v1 = tag.encode() #对象变成字节 print(v,v1)
9.find ，获取匹配的第一个标签 find_all，获取匹配的所有标签
soup = BeautifulSoup(html_doc,'html.parser') tag = soup.find('a') tags = soup.find_all('body') for i in tags: print(i)
v = soup.find_all(name=['a','div'])
print(v)
10.has_attr，检查标签是否具有该属性
soup = BeautifulSoup(html_doc,'html.parser') tag = soup.find('a') v = tag.has_attr('class') v1 = tag.has_attr('id') print(v,v1)
11.get_text，获取标签内部文本内容
soup = BeautifulSoup(html_doc,'html.parser') tag = soup.find('a') v = tag.get_text() print(v)
其他标签：

index，检查标签在某标签中的索引位置

is_empty_element,是否是空标签(是否可以是空)或者自闭合标签

select,select_one, CSS选择器

soup.select("body a")

标签内容：

print(tag.string)

tag.string = 'new content' # 设置

append在当前标签内部追加一个标签

insert在当前标签内部指定位置插入一个标签

insert_after,insert_before 在当前标签后面或前面插入

replace_with 在当前标签替换为指定标签

wrap，将指定标签把当前标签包裹起来

unwrap，去掉当前标签，将保留其包裹的标签
查看全文

相关阅读:
iOS客户端的gzip解压
 如何将应用打包成.ipa文件（越狱）
手势UIGestureRecognizer
ASIHTTPRequest下载数据
 ASIHTTPRequest下载示例（支持断点续传）
ios与android设备即时语音互通的录音格式
 删除沙盒中文件夹下所有文件
 svn，静态库无法添加
 iOS 3D UI——CALayer的transform扩展解析
 自定义Tabbar实现push动画隐藏效果

原文地址：https://www.cnblogs.com/liumj0305/p/7109642.html

模块

Http请求基础

GET请求

POST请求

其他请求

BeautifulSoup模块

安装：

具体方法：