zoukankan      html  css  js  c++  java
  • python爬虫项目(scrapy-redis分布式爬取房天下租房信息)

    python爬虫scrapy项目(二)

      爬取目标:房天下全国租房信息网站(起始url:http://zu.fang.com/cities.aspx)

      爬取内容:城市;名字;出租方式;价格;户型;面积;地址;交通

      反反爬措施:设置随机user-agent、设置请求延时操作、

    1、开始创建项目

    1 scrapy startproject fang

    2、进入fang文件夹,执行启动spider爬虫文件代码,编写爬虫文件。

    1 scrapy genspider zufang "zu.fang.com"

      命令执行完,用Python最好的IDE---pycharm打开该文件目录

    3、编写该目录下的items.py文件,设置你需要爬取的字段。

     1 import scrapy
     2 
     3 
     4 class HomeproItem(scrapy.Item):
     5     # define the fields for your item here like:
     6     # name = scrapy.Field()
     7 
     8     city = scrapy.Field()  #城市
     9     title = scrapy.Field()  # 名字
    10     rentway = scrapy.Field()  # 出租方式
    11     price = scrapy.Field()    #价格
    12     housetype = scrapy.Field()  # 户型
    13     area = scrapy.Field()  # 面积
    14     address = scrapy.Field()  # 地址
    15     traffic = scrapy.Field()  # 交通

    4、进入spiders文件夹,打开hr.py文件,开始编写爬虫文件

     1 # -*- coding: utf-8 -*-
     2 import scrapy
     3 from homepro.items import HomeproItem
     4 from scrapy_redis.spiders import RedisCrawlSpider
     5 # scrapy.Spider
     6 class HomeSpider(RedisCrawlSpider):
     7     name = 'home'
     8     allowed_domains = ['zu.fang.com']
     9     # start_urls = ['http://zu.fang.com/cities.aspx']
    10     
    11     redis_key = 'homespider:start_urls'
    12     def parse(self, response):
    13         hrefs = response.xpath('//div[@class="onCont"]/ul/li/a/@href').extract()
    14         for href in hrefs:
    15             href = 'http:'+ href
    16             yield scrapy.Request(url=href,callback=self.parse_city,dont_filter=True)
    17 
    18 
    19     def parse_city(self, response):
    20         page_num = response.xpath('//div[@id="rentid_D10_01"]/span[@class="txt"]/text()').extract()[0].strip('共页')
    21         # print('*' * 100)
    22         # print(page_num)
    23         # print(response.url)
    24 
    25         for page in range(1, int(page_num)):
    26             if page == 1:
    27                 url = response.url
    28             else:
    29                 url = response.url + 'house/i%d' % (page + 30)
    30             print('*' * 100)
    31             print(url)
    32             yield scrapy.Request(url=url, callback=self.parse_houseinfo, dont_filter=True)
    33 
    34     def parse_houseinfo(self, response):
    35         divs = response.xpath('//dd[@class="info rel"]')
    36         for info in divs:
    37             city = info.xpath('//div[@class="guide rel"]/a[2]/text()').extract()[0].rstrip("租房")
    38             title = info.xpath('.//p[@class="title"]/a/text()').extract()[0]
    39             rentway = info.xpath('.//p[@class="font15 mt12 bold"]/text()')[0].extract().replace(" ", '').lstrip('
    ')
    40             housetype = info.xpath('.//p[@class="font15 mt12 bold"]/text()')[1].extract().replace(" ", '')
    41             area = info.xpath('.//p[@class="font15 mt12 bold"]/text()')[2].extract().replace(" ", '')
    42             addresses = info.xpath('.//p[@class ="gray6 mt12"]//span/text()').extract()
    43             address = '-'.join(i for i in addresses)
    44             try:
    45                 des = info.xpath('.//p[@class ="mt12"]//span/text()').extract()
    46                 traffic = '-'.join(i for i in des)
    47             except Exception as e:
    48                 traffic = "暂无详细信息"
    49 
    50             p_name = info.xpath('.//div[@class ="moreInfo"]/p/text()').extract()[0]
    51             p_price = info.xpath('.//div[@class ="moreInfo"]/p/span/text()').extract()[0]
    52             price = p_price + p_name
    53 
    54             item = HomeproItem()
    55             item['city'] = city
    56             item['title'] = title
    57             item['rentway'] = rentway
    58             item['price'] = price
    59             item['housetype'] = housetype
    60             item['area'] = area
    61             item['address'] = address
    62             item['traffic'] = traffic
    63             yield item

    5、设置setting.py文件,配置scrapy运行的相关内容

     1 # 指定使用scrapy-redis的调度器
     2 SCHEDULER = "scrapy_redis.scheduler.Scheduler"
     3 
     4 # 指定使用scrapy-redis的去重
     5 DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'
     6 
     7 # 指定排序爬取地址时使用的队列,
     8 # 默认的 按优先级排序(Scrapy默认),由sorted set实现的一种非FIFO、LIFO方式。
     9 SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'
    10 
    11 REDIS_HOST = '10.8.153.73'
    12 REDIS_PORT = 6379 
    13 # 是否在关闭时候保留原来的调度器和去重记录,True=保留,False=清空
    14 SCHEDULER_PERSIST = True                                            

    6、然后把代码发给其他附属机器,分别启动.子程序redis链接主服务器redis。

    1 redis-cli   -h  主服务器ip

    7、主服务器先启动redis-server,再启动redis-cli

    1 lpush homespider:start_urls   起始的url 
  • 相关阅读:
    [NOIP2008] 提高组 洛谷P1125 笨小猴
    洛谷P3384 【模板】树链剖分
    Bzoj1503 [NOI2004]郁闷的出纳员
    POJ1422 Air Raid
    洛谷P1133 教主的花园
    洛谷P1186 玛丽卡
    HDU5115 Dire Wolf
    POJ1308 Is It A Tree?
    POJ2513 Colored Sticks
    Bzoj2326 [HNOI2011]数学作业
  • 原文地址:https://www.cnblogs.com/xuechaojun/p/10164939.html
Copyright © 2011-2022 走看看