zoukankan      html  css  js  c++  java
  • scrapy 模拟登录后再抓取

    深度好文:

    from scrapy.contrib.spiders.init import InitSpider
    from scrapy.http import Request, FormRequest
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.contrib.spiders import Rule
    
    class MySpider(InitSpider):
        name = 'myspider'
        allowed_domains = ['domain.com']
        login_page = 'http://www.domain.com/login'
        start_urls = ['http://www.domain.com/useful_page/',
                      'http://www.domain.com/another_useful_page/']
    
        rules = (
            Rule(SgmlLinkExtractor(allow=r'-w+.html$'),
                 callback='parse_item', follow=True),
        )
    
        def init_request(self):
            """This function is called before crawling starts."""
            return Request(url=self.login_page, callback=self.login)
    
        def login(self, response):
            """Generate a login request."""
            return FormRequest.from_response(response,
                        formdata={'name': 'herman', 'password': 'password'},
                        callback=self.check_login_response)
    
        def check_login_response(self, response):
            """Check the response returned by a login request to see if we are
            successfully logged in.
            """
            if "Hi Herman" in response.body:
                self.log("Successfully logged in. Let's start crawling!")
                # Now the crawling can begin..
                self.initialized()
            else:
                self.log("Bad times :(")
                # Something went wrong, we couldn't log in, so nothing happens.
    
        def parse_item(self, response):
    
            # Scrape data from page
    
    备注: 该代码片段来自于: http://www.sharejs.com/codes/python/8544


    使用header
    request_headers = { 'User-Agent': 'PeekABoo/1.3.7' }
    request = urllib2.Request('http://sebsauvage.net', None, request_headers)
    urlfile = urllib2.urlopen(request)
    每天一小步,人生一大步!Good luck~
  • 相关阅读:
    传统 Ajax 已死,Fetch 永生
    redux-thunk, redux-logger 阮一峰 ( react中间件 )
    flow类型检查
    svn删除项目
    svn导入项目
    ubantu搭建svn
    惠普uefi装系统
    win7跳过登陆界面
    phpstorm配置Xdebug进行调试PHP教程
    jquery 给下拉框赋值
  • 原文地址:https://www.cnblogs.com/jkmiao/p/5012969.html
Copyright © 2011-2022 走看看