zoukankan      html  css  js  c++  java
  • python爬虫实战(四)--------豆瓣网的模拟登录(模拟登录和验证码的处理----scrapy)

    在利用scrapy框架爬各种网站时,一定会碰到某些网站是需要登录才能获取信息。

    这两天也在学习怎么去模拟登录,通过自己码的代码和借鉴别人的项目,调试成功豆瓣的模拟登录,顺便处理了怎么自动化的处理验证码

    一般都是通过打码平台处理的,当然你也可以机器学习的知识去识别验证码。后期我想自己做一个关于机器学习识别验证码的API,训练主流的网站,方便自己调用。(还不知道能不能做出来呢,走一步看一步咯!)

    思路

    一、想要实现登录豆瓣关键点

    1. 分析真实post地址 ----寻找它的formdata,如下图,按浏览器的F12可以找到。
    2. 模拟post ----构造类似的formdata
    3. 验证码处理 ----打码平台

    实战操作

    相关代码已经调试成功----2017-4-5

    目标网站豆瓣网

    实现:模拟登录豆瓣,验证码处理,登录到个人主页就算是success

    数据:没有抓取数据,此实战主要是模拟登录和处理验证码的学习。要是有需求要抓取数据,编写相关的抓取规则即可抓取内容。

    登录成功展示如图:

    我在这里贴出主要代码,完整代码请移步我的github:https://github.com/pujinxiao/douban_login

    spiders文件夹中DouBan.py主要代码如下:

     1 # -*- coding: utf-8 -*-
     2 import scrapy,urllib,re
     3 from scrapy.http import Request,FormRequest
     4 import ruokuai
     5 class DoubanSpider(scrapy.Spider):
     6     name = "DouBan"
     7     allowed_domains = ["douban.com"]
     8     #start_urls = ['http://douban.com/']
     9     header={"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"} #供登录模拟使用
    10     def start_requests(self):
    11         url='https://www.douban.com/accounts/login'
    12         return [Request(url=url,meta={"cookiejar":1},callback=self.parse)]#可以传递一个标示符来使用多个。如meta={'cookiejar': 1}这句,后面那个1就是标示符
    13 
    14     def parse(self, response):
    15         captcha=response.xpath('//*[@id="captcha_image"]/@src').extract()  #获取验证码图片的链接
    16         print captcha
    17         if len(captcha)>0:
    18             '''此时有验证码'''
    19             #人工输入验证码
    20             #urllib.urlretrieve(captcha[0],filename="C:/Users/pujinxiao/Desktop/learn/douban20170405/douban/douban/spiders/captcha.png")
    21             #captcha_value=raw_input('查看captcha.png,有验证码请输入:')
    22 
    23             #用快若打码平台处理验证码--------验证码是任意长度字母,成功率较低
    24             captcha_value=ruokuai.get_captcha(captcha[0])
    25             reg=r'<Result>(.*?)</Result>'
    26             reg=re.compile(reg)
    27             captcha_value=re.findall(reg,captcha_value)[0]
    28             print '验证码为:',captcha_value
    29 
    30             data={
    31                 "form_email": "weisuen007@163.com",
    32                 "form_password": "weijc7789",
    33                 "captcha-solution": captcha_value,
    34                 #"redir": "https://www.douban.com/people/151968962/",      #设置需要转向的网址,由于我们需要爬取个人中心页,所以转向个人中心页
    35             }
    36         else:
    37             '''此时没有验证码'''
    38             print '无验证码'
    39             data={
    40                 "form_email": "weisuen007@163.com",
    41                 "form_password": "weijc7789",
    42                 #"redir": "https://www.douban.com/people/151968962/",
    43             }
    44         print '正在登陆中......'
    45         ####FormRequest.from_response()进行登陆
    46         return [
    47             FormRequest.from_response(
    48                 response,
    49                 meta={"cookiejar":response.meta["cookiejar"]},
    50                 headers=self.header,
    51                 formdata=data,
    52                 callback=self.get_content,
    53             )
    54         ]
    55     def get_content(self,response):
    56         title=response.xpath('//title/text()').extract()[0]
    57         if u'登录豆瓣' in title:
    58             print '登录失败,请重试!'
    59         else:
    60             print '登录成功'
    61             '''
    62             可以继续后续的爬取工作
    63             '''

    ruokaui.py代码如下:

    我所用的是若块打码平台,选择url识别验证码,直接给打码平台验证码图片的链接地址,传回验证码的值。

      1 # -*- coding: utf-8 -*-
      2 import sys, hashlib, os, random, urllib, urllib2
      3 from datetime import *
      4 
      5 class APIClient(object):
      6     def http_request(self, url, paramDict):
      7         post_content = ''
      8         for key in paramDict:
      9             post_content = post_content + '%s=%s&'%(key,paramDict[key])
     10         post_content = post_content[0:-1]
     11         #print post_content
     12         req = urllib2.Request(url, data=post_content)
     13         req.add_header('Content-Type', 'application/x-www-form-urlencoded')
     14         opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())  
     15         response = opener.open(req, post_content)  
     16         return response.read()
     17 
     18     def http_upload_image(self, url, paramKeys, paramDict, filebytes):
     19         timestr = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
     20         boundary = '------------' + hashlib.md5(timestr).hexdigest().lower()
     21         boundarystr = '
    --%s
    '%(boundary)
     22         
     23         bs = b''
     24         for key in paramKeys:
     25             bs = bs + boundarystr.encode('ascii')
     26             param = "Content-Disposition: form-data; name="%s"
    
    %s"%(key, paramDict[key])
     27             #print param
     28             bs = bs + param.encode('utf8')
     29         bs = bs + boundarystr.encode('ascii')
     30         
     31         header = 'Content-Disposition: form-data; name="image"; filename="%s"
    Content-Type: image/gif
    
    '%('sample')
     32         bs = bs + header.encode('utf8')
     33         
     34         bs = bs + filebytes
     35         tailer = '
    --%s--
    '%(boundary)
     36         bs = bs + tailer.encode('ascii')
     37         
     38         import requests
     39         headers = {'Content-Type':'multipart/form-data; boundary=%s'%boundary,
     40                    'Connection':'Keep-Alive',
     41                    'Expect':'100-continue',
     42                    }
     43         response = requests.post(url, params='', data=bs, headers=headers)
     44         return response.text
     45 
     46 def arguments_to_dict(args):
     47     argDict = {}
     48     if args is None:
     49         return argDict
     50     
     51     count = len(args)
     52     if count <= 1:
     53         print 'exit:need arguments.'
     54         return argDict
     55     
     56     for i in [1,count-1]:
     57         pair = args[i].split('=')
     58         if len(pair) < 2:
     59             continue
     60         else:
     61             argDict[pair[0]] = pair[1]
     62 
     63     return argDict
     64 
     65 def get_captcha(image_url):
     66     client = APIClient()
     67     while 1:
     68         paramDict = {}
     69         result = ''
     70         act = raw_input('请输入打码方式url:')
     71         if cmp(act, 'info') == 0: 
     72             paramDict['username'] = raw_input('username:')
     73             paramDict['password'] = raw_input('password:')
     74             result = client.http_request('http://api.ruokuai.com/info.xml', paramDict)
     75         elif cmp(act, 'register') == 0:
     76             paramDict['username'] = raw_input('username:')
     77             paramDict['password'] = raw_input('password:')
     78             paramDict['email'] = raw_input('email:')
     79             result = client.http_request('http://api.ruokuai.com/register.xml', paramDict)
     80         elif cmp(act, 'recharge') == 0:
     81             paramDict['username'] = raw_input('username:')
     82             paramDict['id'] = raw_input('id:')
     83             paramDict['password'] = raw_input('password:')
     84             result = client.http_request('http://api.ruokuai.com/recharge.xml', paramDict)
     85         elif cmp(act, 'url') == 0:
     86             paramDict['username'] = '********'
     87             paramDict['password'] = '********'
     88             paramDict['typeid'] = '2000'
     89             paramDict['timeout'] = '90'
     90             paramDict['softid'] = '76693'
     91             paramDict['softkey'] = 'ec2b5b2a576840619bc885a47a025ef6'
     92             paramDict['imageurl'] = image_url
     93             result = client.http_request('http://api.ruokuai.com/create.xml', paramDict)
     94         elif cmp(act, 'report') == 0:
     95             paramDict['username'] = raw_input('username:')
     96             paramDict['password'] = raw_input('password:')
     97             paramDict['id'] = raw_input('id:')
     98             result = client.http_request('http://api.ruokuai.com/create.xml', paramDict)
     99         elif cmp(act, 'upload') == 0:
    100             paramDict['username'] = '********'
    101             paramDict['password'] = '********'
    102             paramDict['typeid'] = '2000'
    103             paramDict['timeout'] = '90'
    104             paramDict['softid'] = '76693'
    105             paramDict['softkey'] = 'ec2b5b2a576840619bc885a47a025ef6'
    106             paramKeys = ['username',
    107                  'password',
    108                  'typeid',
    109                  'timeout',
    110                  'softid',
    111                  'softkey'
    112                 ]
    113 
    114             from PIL import Image
    115             imagePath = raw_input('Image Path:')
    116             img = Image.open(imagePath)
    117             if img is None:
    118                 print 'get file error!'
    119                 continue
    120             img.save("upload.gif", format="gif")
    121             filebytes = open("upload.gif", "rb").read()
    122             result = client.http_upload_image("http://api.ruokuai.com/create.xml", paramKeys, paramDict, filebytes)
    123         
    124         elif cmp(act, 'help') == 0:
    125             print 'info'
    126             print 'register'
    127             print 'recharge'
    128             print 'url'
    129             print 'report'
    130             print 'upload'
    131             print 'help'
    132             print 'exit'
    133         elif cmp(act, 'exit') == 0:
    134             break
    135         
    136         return result
    ruokuai.py

     笔记

    知识点:
    1. return Request的用法
      return [Request(url=url,meta={"cookiejar":1},callback=self.parse)]   #可以传递一个标示符来使用多个。如meta={'cookiejar': 1}这句,后面那个1就是标示符
    2. 打码平台的使用
      直接利用验证码图片的url接口即可
    3. FormRequest的用法
      return [
          FormRequest.from_response(
              response,
              meta={"cookiejar":response.meta["cookiejar"]},
              headers=self.header,
              formdata=data,
              callback=self.get_content,
          )
      ]

    作者:今孝
    出处:http://www.cnblogs.com/jinxiao-pu/p/6670672.html
    本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接。

  • 相关阅读:
    qt程序编译错误:could not exec ‘/usr/lib/x86_64-linux-gnu/qt4/bin/qmake’
    安装 yaml-cpp,MP4V2
    安装cmake 和 opencv 4.0.0
    windows系统,boost编译安装
    messageQ 消息队列
    fflush 和 fsync 的区别
    开源一个 PDF 小工具集软件【使用 PDFium 库实现】
    封装 libjpeg 库
    纯 C++ 代码实现的 INI 文件读写类
    C++11 —— 使用 thread 实现线程池
  • 原文地址:https://www.cnblogs.com/jinxiao-pu/p/6670672.html
Copyright © 2011-2022 走看看