zoukankan      html  css  js  c++  java
  • crawlspider抽屉爬取实例+分布

    创建项目 scrapy startproject choutiPro  

    创建爬虫文件  scrapy genspider -t crawl chouti www.xxx.com

    进入pycharm 培训setting文件

    配置UA 和robotstxt 配置 

    ROBOTSTXT_OBEY = False
    USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'

    爬虫代码 抽屉式120页码的数据实现爬取
     

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule


    class ChoutiSpider(CrawlSpider):
    name = 'chouti'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://dig.chouti.com/r/scoff/hot/1']

    #连接提取器:可以在起始的url对应的页面源码中提取指定符合要求的连接
    #allow参数值表示的是一个正则表达式规则
    Link = LinkExtractor(allow=r'/r/scoff/hot/d+')
    rules = (
    #实例化了一个规则解析器对象
    #follow True,会作用到没有个页码的中进行提取,这可以一直提取到页码所有页码链接
    #同时调度去会给我们自动去重操作
    Rule(Link, callback='parse_item', follow=True),
    )

    def parse_item(self, response):
    #这里打印页码链接,可以进行详细解析每一个页码中数据
    print(response)

  • 相关阅读:
    LeetCode "Median of Two Sorted Arrays"
    LeetCode "Distinct Subsequences"
    LeetCode "Permutation Sequence"

    LeetCode "Linked List Cycle II"
    LeetCode "Best Time to Buy and Sell Stock III"
    LeetCode "4Sum"
    LeetCode "3Sum closest"
    LeetCode "3Sum"
    LeetCode "Container With Most Water"
  • 原文地址:https://www.cnblogs.com/michael2018/p/10505751.html
Copyright © 2011-2022 走看看