zoukankan      html  css  js  c++  java
  • 爬虫实战篇---使用Scrapy框架进行模拟登录(包括借助阿里云服务自动识别验证码)

    (1)、前言

    原理分析:我们编写代码模拟向网站发出登录请求,也就是提交包含登录信息的表单(用户名、密码等)。

    实现方式:当我们想在请求数据时发送post请求,这时候需要借助Request的子类FormRequest来实现,如果想进一步在爬虫一开始时就发送post请求,那么我们需要重写start_request()方法,舍弃原先的start_url()(采用get请求)

    (2)、模拟登录人人网(例子1)

    1、创建项目

    scrapy startproject renren---cd renren--创建爬虫scrapy genspider spider renren.com

    2、改写settings.py

     1 # -*- coding: utf-8 -*-
     2 
     3 # Scrapy settings for renren project
     4 #
     5 # For simplicity, this file contains only settings considered important or
     6 # commonly used. You can find more settings consulting the documentation:
     7 #
     8 #     https://doc.scrapy.org/en/latest/topics/settings.html
     9 #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    10 #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    11 
    12 BOT_NAME = 'renren'
    13 
    14 SPIDER_MODULES = ['renren.spiders']
    15 NEWSPIDER_MODULE = 'renren.spiders'
    16 
    17 
    18 # Crawl responsibly by identifying yourself (and your website) on the user-agent
    19 #USER_AGENT = 'renren (+http://www.yourdomain.com)'
    20 
    21 # Obey robots.txt rules
    22 ROBOTSTXT_OBEY = False
    23 
    24 # Configure maximum concurrent requests performed by Scrapy (default: 16)
    25 #CONCURRENT_REQUESTS = 32
    26 
    27 # Configure a delay for requests for the same website (default: 0)
    28 # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
    29 # See also autothrottle settings and docs
    30 DOWNLOAD_DELAY = 1
    31 # The download delay setting will honor only one of:
    32 #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    33 #CONCURRENT_REQUESTS_PER_IP = 16
    34 
    35 # Disable cookies (enabled by default)
    36 #COOKIES_ENABLED = False
    37 
    38 # Disable Telnet Console (enabled by default)
    39 #TELNETCONSOLE_ENABLED = False
    40 
    41 # Override the default request headers:
    42 DEFAULT_REQUEST_HEADERS = {
    43     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    44     'Accept-Language': 'en',
    45     'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36',
    46 
    47 
    48 }
    49 
    50 # Enable or disable spider middlewares
    51 # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    52 #SPIDER_MIDDLEWARES = {
    53 #    'renren.middlewares.RenrenSpiderMiddleware': 543,
    54 #}
    55 
    56 # Enable or disable downloader middlewares
    57 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    58 #DOWNLOADER_MIDDLEWARES = {
    59 #    'renren.middlewares.RenrenDownloaderMiddleware': 543,
    60 #}
    61 
    62 # Enable or disable extensions
    63 # See https://doc.scrapy.org/en/latest/topics/extensions.html
    64 #EXTENSIONS = {
    65 #    'scrapy.extensions.telnet.TelnetConsole': None,
    66 #}
    67 
    68 # Configure item pipelines
    69 # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    70 #ITEM_PIPELINES = {
    71 #    'renren.pipelines.RenrenPipeline': 300,
    72 #}
    73 
    74 # Enable and configure the AutoThrottle extension (disabled by default)
    75 # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
    76 #AUTOTHROTTLE_ENABLED = True
    77 # The initial download delay
    78 #AUTOTHROTTLE_START_DELAY = 5
    79 # The maximum download delay to be set in case of high latencies
    80 #AUTOTHROTTLE_MAX_DELAY = 60
    81 # The average number of requests Scrapy should be sending in parallel to
    82 # each remote server
    83 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    84 # Enable showing throttling stats for every response received:
    85 #AUTOTHROTTLE_DEBUG = False
    86 
    87 # Enable and configure HTTP caching (disabled by default)
    88 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    89 #HTTPCACHE_ENABLED = True
    90 #HTTPCACHE_EXPIRATION_SECS = 0
    91 #HTTPCACHE_DIR = 'httpcache'
    92 #HTTPCACHE_IGNORE_HTTP_CODES = []
    93 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

    3、改写spider.py

     1 # -*- coding: utf-8 -*-
     2 import scrapy
     3 
     4 
     5 class SpiderSpider(scrapy.Spider):
     6     name = 'spider'
     7     allowed_domains = ['renren.com']
     8     start_urls = ['http://renren.com/']
     9 
    10 
    11     def start_requests(self):
    12         url = 'http://www.renren.com/PLogin.do'
    13         data = {
    14             'email':'827832075@qq.com',
    15             'password':'56571218lu',
    16         } #构造表单数据
    17         request = scrapy.FormRequest(url ,formdata=data, callback=self.parse_page)
    18         yield request
    19 
    20     def parse_page(self,response):
    21         url2 = 'http://www.renren.com/880792860/profile'
    22         request = scrapy.Request(url2 ,callback=self.parse_profile)
    23         yield request
    24 
    25     def parse_profile(self,response):
    26         with open('baobeier.html','w',encoding='utf-8') as f: 写入文件
    27             f.write(response.text)
    28             f.close()

    4、运行爬虫

    1 #author: "xian"
    2 #date: 2018/6/13
    3 from scrapy import cmdline
    4 cmdline.execute('scrapy crawl spider'.split())

    5、效果展示(我们成功登录并爬取了包贝儿的人人主页)

    (3)、使用阿里云验证码服务自动识别验证码(服务地址:https://market.aliyun.com/products/57126001/cmapi014396.html#sku=yuncode=839600006)

    测试服务:我们同样使用豆瓣登录页面的验证码进行测试:

     1 #author: "xian"
     2 #date: 2018/6/13
     3 from urllib import request
     4 from base64 import b64encode
     5 import requests
     6 
     7 captcha_url = 'https://www.douban.com/misc/captcha?id=oL8chJoRiCTIikzwtEECZNGH:en&size=s'
     8 
     9 request.urlretrieve(captcha_url ,'captcha.png')
    10 
    11 recognize_url = 'http://jisuyzmsb.market.alicloudapi.com/captcha/recognize?type=e'
    12 
    13 formdata = {}
    14 with open('captcha.png','rb') as f:
    15     data = f.read()
    16     pic = b64encode(data)
    17     formdata['pic'] = pic
    18 
    19 appcode = '614a1376aa4340b7a159d551d4eb0179'
    20 headers = {
    21     'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
    22     'Authorization':'APPCODE ' + appcode,
    23 }
    24 
    25 response = requests.post(recognize_url,data = formdata,headers =headers)
    26 print(response.json()) #返回json格式

    运行效果展示:(我们借助阿里云平台成功进行了验证码的自动识别)

    (4)、使用阿里云服务进行验证码验证并模拟登录豆瓣网

    1、创建项目scrapy startproject douban---cd douban---创建爬虫scrapy genspider spider doubao.com(树形目录如下:)

    改写settings.py

    不遵循robots协议

    设置请求头

    设置爬取时间间隔

    改写spider.py

     1 # -*- coding: utf-8 -*-
     2 import scrapy
     3 from urllib import request
     4 from PIL import Image #导入识别图形的库
     5 from base64 import b64encode #导入b64编码库
     6 import requests
     7 
     8 
     9 class SpiderSpider(scrapy.Spider):
    10     name = 'spider'
    11     allowed_domains = ['douban.com']
    12     start_urls = ['https://accounts.douban.com/login'] #起始url
    13     login_url = 'https://accounts.douban.com/login' #登录界面url
    14     profile_url = 'https://www.douban.com/people/179834288/' #个人主要url
    15     editsignature_url = 'https://www.douban.com/j/people/179834288/edit_signature' #编辑签名的接口url
    16 
    17     def parse(self, response):
    18         formdata = {
    19             'source': 'None',
    20             'redir':'https://www.douban.com',
    21             'form_email': '827832075@qq.com',
    22             'form_password': '56571218lu',
    23             'remember': 'on',
    24             'login': '登录',
    25 
    26         } #传入部分表单数据
    27         captcha_url = response.css('img#captcha_image::attr(src)').get() #获取验证码
    28         if captcha_url: #判断是否存在验证码
    29             captcha = self.regonize_captcha(captcha_url) #识别验证码
    30             formdata['captcha-solution'] = captcha #获取captcha-solution表单字段
    31             captcha_id = response.xpath('//input[@name = "captcha-id"]/@value').get() #获取captcha_id表单字段
    32             formdata['captcha-id'] = captcha_id 
    33         yield scrapy.FormRequest(url = self.login_url,formdata=formdata,callback=self.parse_after_login) #提交表单数据
    34 
    35 
    36     def parse_after_login(self,response): #解析登录页面函数
    37         if response.url == 'https://www.douban.com': #判断是否登录成功
    38             yield scrapy.Request(self.profile_url,callback=self.parse_profile) #如果登录成功向个人主页发送请求并回调解析函数
    39             print('登录成功!')
    40         else:
    41             print('登录失败!')
    42 
    43     def parse_profile(self,response): #解析个人主页函数
    44         print(response.url)
    45         if response.url == self.profile_url: #判断是否成功到达个人主页
    46             ck = response.xpath('//input[@name = "ck"]/@value').get() #获取ck value的值
    47             formdata = {
    48                 'ck': ck,
    49                 'signature':'积土成山,风雨兴焉!',
    50             } #构造表单数据
    51             yield scrapy.FormRequest(self.editsignature_url,formdata=formdata,callback=self.parse_None) 提交表单数据,最后callback 指定回调函数,这里如果不指定回调函数默认回调parse,最后会出现登录失败信息
    52         else:
    53             print('进入个人中心失败了!')
    54 
    55     def parse_None(self,response):
    56         pass
    57 
    58 
    59 #部分为人工验证码识别登录方式
    60     # def regonize_captcha(self,image_url):
    61     #     request.urlretrieve(image_url,'captcha.png')
    62     #     image = Image.open('captcha.png')
    63     #     image.show()
    64     #     captcha = input('请您输入验证码:')
    65     #     return captcha
    66 
    67     def regonize_captcha(self, image_url): #这里使用上面的阿里云服务识别验证码,参考阿里云上提供的使用手册即可
    68         captcha_url = image_url
    69 
    70         request.urlretrieve(captcha_url, 'captcha.png')
    71 
    72         recognize_url = 'http://jisuyzmsb.market.alicloudapi.com/captcha/recognize?type=e'
    73 
    74         formdata = {}
    75         with open('captcha.png', 'rb') as f:
    76             data = f.read()
    77             pic = b64encode(data)
    78             formdata['pic'] = pic
    79 
    80         appcode = '614a1376aa4340b7a159d551d4eb0179'
    81         headers = {
    82             'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
    83             'Authorization': 'APPCODE ' + appcode,
    84         }
    85 
    86         response = requests.post(recognize_url, data=formdata, headers=headers)
    87         result = response.json()
    88         code = result['result']['code']
    89         return code

    最后运行爬虫项目:

    新建一个main.py方便调试

    1 #author: "xian"
    2 #date: 2018/6/13
    3 from scrapy import cmdline
    4 cmdline.execute('scrapy crawl spider'.split())

    运行结果:(部分)

    我们可以看到我的主页的个性签名已经改变为我们设置的了!(程序成功运行了!)

    (5)、总结

    1、在scrapy中想要发送Post请求,推荐使用scrapy.FormRequest方法,并指定表单数据

    2、在爬虫开始时发送Post请求,请重写start_requests()方法

  • 相关阅读:
    SQL Server分页存储过程研究
    sql子查询 嵌套SELECT实用语句
    SQL语句优化
    CSS开发中常用技巧总结
    Html.DropDownList 的用法
    ADOStoredProc动态调用存储过程
    Sql Server 乐观锁和悲观锁理解和应用
    DOM 解析操作知识
    Java 中 Jar 命令的使用
    CDATA 基本知识
  • 原文地址:https://www.cnblogs.com/518894-lu/p/9178043.html
Copyright © 2011-2022 走看看