这里我以GitHub登录这个网站为例
https://github.com/login
通过分析,我们可以得知这个网站上post必须带一个参数(用chrome或firefox都能看到表单提交了什么)authenticity_token
然后我们发现authenticity_token这个参数是一直在变的,所以我们需要使用Session会话使得我们的多个请求都在一个Session里,
从而得到相同的authenticity_token
这里补充一个知识点,就是Scrapy怎么debug:https://zhuanlan.zhihu.com/p/25200262
我这里用的第二个方法,就是建一个run.py debug
logger:scrapy为每个spider实例内置的日志记录器具体信息参考官网https://doc.scrapy.org/en/latest/topics/logging.html。
总的来说就是Scrapy的Session用meta = {'cookiejar': i},只要这个cookiejar的值相同,他们就在一个Session里,
Requests的Session用s = requests.Session()(这个时候s就是一个session,我们
只要在一个Session里搞事就行了,比如说s.get(), s.post)
先贴scrapy的code(emmm,这个是spider,大家肯定都知道):
关于scrapy提交表单的中文文档:https://www.rddoc.com/doc/Scrapy/1.3/zh/topics/request-response/
官方文档:https://doc.scrapy.org/en/latest/topics/request-response.html
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from scrapy.http import Request, FormRequest 4 5 class GithubSpider(scrapy.Spider): 6 name = 'github' 7 #allowed_domains = ['github.com'] 8 #start_urls = ['https://github.com/login'] 9 headers = { 10 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36', 11 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 12 'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2', 13 'Accept-Encoding': 'gzip, deflate, br', 14 'Referer': 'https://github.com/login', 15 'Content-Type': 'application/x-www-form-urlencoded', 16 } 17 18 def start_requests(self): 19 urls = ['https://github.com/login'] 20 for i, url in enumerate(urls, 1): 21 yield Request(url, meta = {'cookiejar': i}, callback = self.github_parse) 22 23 def github_parse(self, response): 24 authenticity_token = response.xpath('//*[@id="login"]/form/input[2]/@value').extract()[0]#or use extract_first 25 self.logger.info('authenticity_token=' + authenticity_token) 26 return FormRequest.from_response( 27 response, 28 url = 'https://github.com/session', 29 meta = {'cookiejar': response.meta['cookiejar']}, 30 headers = self.headers, 31 formdata = { 32 'login':'email',#youremail 33 'password':'password',#yourpassword 34 'authenticity_token':authenticity_token, 35 'utf8':'✓' 36 }, 37 callback = self.github_login, 38 # dont_click = True, 39 ) 40 41 def github_login(self, response): 42 data = response.xpath('//*[@id="dashboard"]/div[1]/div[2]/h3/text()').extract_first() 43 if data: 44 self.logger.info('我已经登录成功了!') 45 self.logger.info(data) 46 else: 47 self.logger.error('登录失败!')
然后是用requests的Session(这个写的比较随意,当初是为了看看对requests的Session理解有没有问题),
不知道的详见文档http://docs.python-requests.org/zh_CN/latest/user/advanced.html#advanced
1 import requests 2 # from bs4 import BeautifulSoup 3 from lxml import etree 4 5 headers = { 6 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36', 7 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 8 'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2', 9 'Accept-Encoding': 'gzip, deflate, br', 10 'Referer': 'https://github.com/login', 11 'Content-Type': 'application/x-www-form-urlencoded', 12 } 13 14 session = requests.Session() 15 url1 = 'https://github.com/login' 16 r1 = session.get(url1) 17 html1 = r1.text 18 s = etree.HTML(html1) 19 authenticity_token = s.xpath('//*[@id="login"]/form/input[2]/@value') 20 21 url2 = 'https://github.com/session' 22 formdata = { 23 'login':'email',#enter you email 24 'password':'password',#enter you password 25 'authenticity_token':authenticity_token, 26 'utf8':'✓' 27 } 28 r2 = session.post(url2, data = formdata, headers = headers) 29 html2 = r2.text 30 s2 = etree.HTML(html2) 31 data = s2.xpath('//*[@id="dashboard"]/div[1]/div[2]/h3/text()') 32 if data: 33 print('Success') 34 else: 35 print('Fail')
再重复一遍
总的来说就是Scrapy的Session用meta = {'cookiejar': i},只要这个cookiejar的值相同,他们就在一个Session里,
Requests的Session用s = requests.Session(),s.get(), s.post