zoukankan      html  css  js  c++  java
  • 爬虫

    1. 代码

    import re
    
    import scrapy
    
    from Fang.items import esf_FangItem
    
    
    class ExampleSpider(scrapy.Spider):
        name = 'example'
        allowed_domains = ['www.fang.com']
        start_urls = ['https://www.fang.com/SoufunFamily.htm']
    
        def parse(self, response):
            trs = response.xpath('//div[@id="c02"]//tr')
            province = None
            for tr in trs:
                province_f = tr.xpath('./td[2]//text()').get()
                province_f = re.sub(r"s", "", province_f)
                if province_f:
                    province = province_f
                cities = tr.xpath('./td[3]/a')
                for i in cities:
                    city = i.xpath('./text()').get()
                    city_url = i.xpath('./@href').get()
                    # print(city, city_url)
                    yield scrapy.Request(url=city_url, callback=self.parse_url, meta={'info': (province, city)})
                #     break
                # break
    
        def parse_url(self, response):
            print(2)

    2. 问题描述运行项目时,parse_url不执行,即不能打印2

      日志打印如下:

    2020-03-20 17:08:02 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'bj.fang.com': <GET http://bj.fang.com/>
    2020-03-20 17:08:02 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'sh.fang.com': <GET http://sh.fang.com/>
    2020-03-20 17:08:02 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'tj.fang.com': <GET http://tj.fang.com/>
    2020-03-20 17:08:02 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'cq.fang.com': <GET http://cq.fang.com/>

    3. 解答

      百度得知是二次解析的域名被过滤掉了

      解决方法:

       

    方法一:
      去掉域名: allowed_domains = ['www.fang.com']
      或将其改为:  allowed_domains = ['fang.com']

    方法二:
      加上:
    dont_filter=True (不推荐此方法)
      yield scrapy.Request(url=city_url, callback=self.parse_url, meta={'info': (province, city)}, dont_filter=True)
     
     
  • 相关阅读:
    Palindrome Partitioning
    triangle
    Populating Next Right Pointers in Each Node(I and II)
    分苹果(网易)
    Flatten Binary Tree to Linked List
    Construct Binary Tree from Inorder and Postorder Traversal(根据中序遍历和后序遍历构建二叉树)
    iOS系统navigationBar背景色,文字颜色处理
    登录,注销
    ios 文字上下滚动效果Demo
    经常崩溃就是数组字典引起的
  • 原文地址:https://www.cnblogs.com/JackShi/p/12532987.html
Copyright © 2011-2022 走看看