zoukankan      html  css  js  c++  java
  • 自定义 Scrapy 爬虫请求的 URL

    之前使用 scrapy 抓取数据的时候 ,默认是在逻辑中判断是否执行下一次请求

    def parse(self):
        # 获取所有的url,例如获取到urls中
        for url in urls:
            yield Request(url)

    比如:

    def parse(self,response):
        item = MovieItem()
        selector = Selector(response)
        Movies = selector.xpath('//div[@class="info"]')
        for eachMoive in Movies:
            title = eachMoive.xpath('div[@class="hd"]/a/span/text()').extract()
            star = eachMoive.xpath('div[@class="bd"]/div[@class="star"]/span/em/text()').extract()[0]
            quote = eachMoive.xpath('div[@class="bd"]/p[@class="quote"]/span/text()').extract()
            nextLink = selector.xpath('//span[@class="next"]/link/@href').extract()
        #下一页
        if nextLink:
            nextLink = nextLink[0]
            yield Request(self.url + nextLink,callback=self.parse)

    今天无意查看了 scrapy 的官方文档,可以使用 start_requests() 这个方法循环生成要爬取的网址

      def start_requests(self):
            urls=[]
            for i in range(1,10):
                url='http://www.test.com/?page=%s'%i
                page=scrapy.Request(url)
                urls.append(page)
            return urls

    使用 python 一定要简单粗暴,于是把我把之前代码换了如下方式

        # 开始URL
        start_urls = [
          "http://q.stock.sohu.com"
        ]
    
        #定义爬取的URL
        def start_requests(self):
            # 按日
            return [Request(("http://q.stock.sohu.com/hisHq?code=cn_{0}"+"&start=" + self.begin_date + "&end=" + self.end_date + "&stat=1&order=D&period=d&rt=json&r=0.6618998353094041&0.8423532517054869").format(x['code'])) for x in self.stock_basics]

    注意:要注意的是重写 start_requests 这个方法,则不需要设置 start_urls 了 ,并且写了 start_urls 也没有用

    This method must return an iterable with the first Requests to crawl for this spider.
    This is the method called by Scrapy when the spider is opened for scraping when no particular URLs are specified. If particular URLs are specified, the make_requests_from_url() is used instead to create the Requests. This method is also called only once from Scrapy, so it’s safe to implement it as a generator.
    The default implementation uses make_requests_from_url() to generate Requests for each url in start_urls.

    REFER:
    http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.start_requests
    python爬虫----(scrapy框架提高(1),自定义Request爬取)
    https://my.oschina.net/lpe234/blog/342741

  • 相关阅读:
    vue-router在两个以上子路由的情况下,跳转出错
    全局window下添加可变量
    nuxtjs 环境中添加全局axios
    nuxt.js 初始化 npm run dev 报错
    replace的回调函数。
    JS面向对象的类 实例化与继承
    DOM事件: DOM事件级别、DOM事件流、DOM事件模型、DOM事件捕获过程、自定义事件
    sync 简单实现 父子组件的双向绑定
    cube-ui 重构饿了吗Webapp的 scroll-nav域名插槽问题
    在element table中导出指定列信息
  • 原文地址:https://www.cnblogs.com/Irving/p/6217263.html
Copyright © 2011-2022 走看看