zoukankan      html  css  js  c++  java
  • 爬取二重网页

    1.用 scrapy 新建一个 sun0769 项目

    scrapy startproject sun0769

    2.在 items.py 中确定要爬去的内容

     1 import scrapy
     2 
     3 
     4 class Sun0769Item(scrapy.Item):
     5     # define the fields for your item here like:
     6     # name = scrapy.Field()
     7     problem_type = scrapy.Field()
     8     title = scrapy.Field() 
     9     number = scrapy.Field() 
    10     content = scrapy.Field() 
    11     Processing_status = scrapy.Field()
    12     url = scrapy.Field() 

    3.快速创建 CrawlSpider模板

    scrapy genspider -t crawl dongguan wz.sun0769.com

    注意  此时中的名称不能与项目名相同

    4.打开 dongguan.py 编写代码

     1 # -*- coding: utf-8 -*-
     2 # 导入scrapy 模块
     3 import scrapy
     4 # 导入匹配规则类,用来提取符合规则的链接
     5 from scrapy.linkextractors import LinkExtractor
     6 # 导入CrawlSpiderl类和Rule
     7 from scrapy.spiders import CrawlSpider, Rule
     8 # 导入items中的类
     9 from sun0769.items import Sun0769Item
    10 
    11 class DongguanSpider(CrawlSpider):
    12     name = 'dongguan'
    13     allowed_domains = ['wz.sun0769.com']
    14     start_urls = ['http://d.wz.sun0769.com/index.php/question/huiyin?page=30']
    15     pagelink = LinkExtractor(allow=r"page=d+")
    16     pagelink2 = LinkExtractor(allow=r"/question/d+/d+.shtml")
    17 
    18     rules = (
    19         Rule(pagelink, follow=True ),
    20         Rule(pagelink2, callback='parse_item',follow=True ),
    21 
    22     )
    23 
    24     def parse_item(self, response):
    25         #print response.url 
    26         item = Sun0769Item() 
    27         # xpath 返回是一个列表
    28         #item['problem_type'] = response.xpath('//a[@class="red14"]').extract()
    29         item['title'] = response.xpath('//div[@class="pagecenter p3"]//strong[@class="tgray14"]/text()').extract()[0].split(" ")[-1].split(":")[-1]
    30         # item['title'] = response.xpath('//div[@class="pagecenter p3"]//strong[@class="tgray14"]/text()').extract()[0]
    31         item['number'] = response.xpath('//div[@class="pagecenter p3"]//    strong[@class="tgray14"]/text()').extract()[0].split("")[1].split("  ")[0]
    32         #item['content'] = response.xpath().extract()
    33         #item['Processing_status'] = response.xpath('//div/span[@class="qgrn"]/text()').extract()[0]
    34         # 把数据传出去
    35         yield item
    36         
    37         

    5.在piplines.py写代码

     1 # -*- coding: utf-8 -*-
     2 
     3 # Define your item pipelines here
     4 #
     5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
     6 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
     7 
     8 import json
     9 
    10 class TencentPipeline(object):
    11     def open_spider(self, spider):
    12         self.filename = open("dongguan.json", "w")
    13 
    14     def process_item(self, item, spider):
    15         text = json.dumps(dict(item), ensure_ascii = False) + "
    "
    16         self.filename.write(text.encode("utf-8")
    17         return item
    18 
    19     def close_spider(self, spider):
    20         self.filename.close()
    复制代码

    6.在setting.py设置相关内容


    问题:

    1.怎么把不同页面的内容整合到一块

    2.内容匹配还有些困难(xpath,re)

  • 相关阅读:
    Linux 上网络监控工具 ntopng 的安装
    Linux 运维工程师的十个基本技能点
    HashMap、Hashtable、ConcurrentHashMap的区别
    Spark会产生shuffle的算子
    Scala基础:闭包、柯里化、隐式转换和隐式参数
    Scala基础:模式匹配和样例类
    Scala基础:面向对象之trait
    Scala基础:面向对象之对象和继承
    Scala基础:类和构造器
    Scala基础:数组(Array)、映射(Map)、元组(Tuple)、集合(List)
  • 原文地址:https://www.cnblogs.com/cuzz/p/7630314.html
Copyright © 2011-2022 走看看