zoukankan      html  css  js  c++  java
  • 爬虫

    1. 代码

    import re
    
    import scrapy
    
    from Fang.items import esf_FangItem
    
    
    class ExampleSpider(scrapy.Spider):
        name = 'example'
        allowed_domains = ['www.fang.com']
        start_urls = ['https://www.fang.com/SoufunFamily.htm']
    
        def parse(self, response):
            trs = response.xpath('//div[@id="c02"]//tr')
            province = None
            for tr in trs:
                province_f = tr.xpath('./td[2]//text()').get()
                province_f = re.sub(r"s", "", province_f)
                if province_f:
                    province = province_f
                cities = tr.xpath('./td[3]/a')
                for i in cities:
                    city = i.xpath('./text()').get()
                    city_url = i.xpath('./@href').get()
                    # print(city, city_url)
                    yield scrapy.Request(url=city_url, callback=self.parse_url, meta={'info': (province, city)})
                #     break
                # break
    
        def parse_url(self, response):
            print(2)

    2. 问题描述运行项目时,parse_url不执行,即不能打印2

      日志打印如下:

    2020-03-20 17:08:02 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'bj.fang.com': <GET http://bj.fang.com/>
    2020-03-20 17:08:02 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'sh.fang.com': <GET http://sh.fang.com/>
    2020-03-20 17:08:02 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'tj.fang.com': <GET http://tj.fang.com/>
    2020-03-20 17:08:02 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'cq.fang.com': <GET http://cq.fang.com/>

    3. 解答

      百度得知是二次解析的域名被过滤掉了

      解决方法:

       

    方法一:
      去掉域名: allowed_domains = ['www.fang.com']
      或将其改为:  allowed_domains = ['fang.com']

    方法二:
      加上:
    dont_filter=True (不推荐此方法)
      yield scrapy.Request(url=city_url, callback=self.parse_url, meta={'info': (province, city)}, dont_filter=True)
     
     
  • 相关阅读:
    强大的晶体管
    FPGA--数字芯片之母
    方波中的毛刺
    运放,运放
    解决标准FPGA资源丰富却浪费的问题
    国产FPGA市场分析 该如何破局
    流行的FPGA的上电复位
    Git 学习笔记
    日志格式的配置
    Shiro 笔记
  • 原文地址:https://www.cnblogs.com/JackShi/p/12532987.html
Copyright © 2011-2022 走看看