1.requests模块的get请求
需求:爬取sogou首页页面数据
import requests url = "https://www.sogou.com/" response = requests.get(url=url) # 获取字符串形式的页面数据 page = response.text with open("./sogou.html", "w", encoding="utf-8") as fp: fp.write(page)
其他的一些方法如下:
# 获取二进制/byte形式的页面数据 print(response.content) # 获取响应状态码 print(response.status_code) # 获取响应的头信息 print(response.headers) # 获取请求的url print(response.url)
至于带参数的get请求,直接调用get方法或者使用字典参数的方法,如下代码:
import requests url = "https://www.sogou.com/web" # 将参数封装到字典中 params = { 'query': "周杰伦", 'ie': "utf-8", } response = requests.get(url=url, params=params) print(response.status_code) print(response.content)
同样方法还有headers参数。
2.post请求
登陆豆瓣电影,获取登陆成功后的数据(这里由于豆瓣的url已经更换,所以只是示例)
import requests # 目前这个url已经失效了,这里只做示例 url = "https://accounts.douban.com/login" # 封装post请求的参数 data = { "source": "movie", "redir": "https://movie.douban.com/", "form_email": "1111", # 你的账号密码 "form_password": "11111", "login": "登录", } headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36", } # 发起post请求 response = requests.post(url=url, data=data, headers=headers) print(response.status_code) print(response.text) with open("./douban.html", "w", encoding="utf-8") as fp: fp.write(response.text)
3.Ajax的get请求
需求:抓取豆瓣电影上排行榜上爱情片的详情
import requests url = "https://movie.douban.com/j/chart/top_list?" params = { "type": "5", "interval_id": "100:90", "action": "", "start": "120", "limit": "20", } headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36", } response = requests.get(url=url, params=params, headers=headers) print(response.text)
4.Ajax的post请求
需求:爬取肯德基城市餐厅的位置数据
import requests url = "http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword" params = { "cname": "", "pid": "", "keyword": "北京", "pageIndex": "1", "pageSize": "10", } headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36", } response = requests.post(url=url, params=params, headers=headers) print(response.text)
5.综合操作
需求:爬取搜狗知乎某一个词条多个页码的页面数据
import requests import os # 创建一个文件夹 if not os.path.exists("./pages"): os.mkdir("./pages") url = "https://zhihu.sogou.com/zhihu?" headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36", } # 搜索的词条 word = input("please enter your word:") # 指定页码的范围 start_num = int(input("enter start page number:")) end_num = int(input("enter end page number:")) for page in range(start_num, end_num+1): param = { "query": word, "page": page, "ie": "utf-8", } response = requests.get(url=url, params=param, headers=headers) filename = word + str(page) + ".html" # 持久化数据 with open("pages/%s" % filename, "w", encoding="utf-8") as fp: fp.write(response.text)
6.cookie操作
流程:1.登录,获取cookie 2.在发起个人主页请求时,需要cookie携带到该请求中
注意: session对象,发送请求(会将cookie对象进行自动存储)
import requests session = requests.session() # 发起登录请求 login_url = "https://accounts.douban.com/passport/login" data = { "source": 'None', "redir": "https://movie.douban.com/people/123/", "form_email": "123", # 你的账号密码 "form_password": "123", "login": "登录", } headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36", } session_response = session.post(url=login_url, data=data, headers=headers) url = 'https://movie.douban.com/people/123/' response = session.get(url=url, headers=headers) page = response.text with open("./doubanlogin.html", "w", encoding="utf-8") as fp: fp.write(page)
注意:由于豆瓣的api更换了上述参数失效了,了解流程就好。当然你可以直接构造cookie来模拟登录,当然这样非常繁琐。
7.代理操作
import requests proxies = { "http": "http://10.10.1.10:3128", "https": "http://10.10.1.10:1080", } url = "https://www.taobao.com" requests.get(url=url, proxies=proxies)
当然这个代理是无效的,要换成你自己的有效代理。requests支持socks协议的代理,需要用到socks库。
对于requests模块,远不止这些功能,需要的自己详细了解。