zoukankan      html  css  js  c++  java
  • Scrapy代码实战

    1.Spider爬虫代码

     1 # -*- coding: utf-8 -*-
     2 import scrapy
     3 from yszd.items import YszdItem
     4 
     5 
     6 class YszdSpiderSpider(scrapy.Spider):
     7     # 爬虫名称,启动爬虫时必须的参数
     8     name = 'yszd_spider'
     9     # 爬取域范围,运行爬虫在这个域名下爬取数据(可选)
    10     allowed_domains = ['itcast.cn']
    11     # 起始url列表,爬虫执行后第一批请求将从这个列表里获取
    12     start_urls = ['http://www.itcast.cn/channel/teacher.shtml']
    13 
    14     def parse(self, response):
    15         # //表示跳级定位,即对当前元素的所有子节点进行查找,一般开头都是使用跳级定位
    16         # div[@class='li_txt'] : 查找div且属性class='li_txt'的
    17         node_list = response.xpath("//div[@class='li_txt']")
    18         # 存储所有item字段
    19         # items = []
    20         for node in node_list:
    21             # 创建item字段对象用来存储信息
    22             item = YszdItem()
    23             # extract() : 将xpath对象转换为Unicode字符串
    24             name = node.xpath("./h3/text()").extract()
    25             title = node.xpath("./h4/text()").extract()
    26             info = node.xpath("./p/text()").extract()
    27             
    28             item['name'] = name[0]
    29             item['title'] = title[0]
    30             item['info'] = info[0]
    31 
    32             yield item
    33             # items.append(item)
    2.Item代码(定义爬取的字段)
     1 # -*- coding: utf-8 -*-
     2 
     3 # Define here the models for your scraped items
     4 #
     5 # See documentation in:
     6 # https://doc.scrapy.org/en/latest/topics/items.html
     7 
     8 import scrapy
     9 
    10 
    11 class YszdItem(scrapy.Item):
    12     name = scrapy.Field()
    13     title = scrapy.Field()
    14     info = scrapy.Field()

    3.Pipelines管道代码

     1 # -*- coding: utf-8 -*-
     2 
     3 # Define your item pipelines here
     4 #
     5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
     6 # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
     7 
     8 import json
     9 
    10 
    11 class YszdPipeline(object):
    12     def __init__(self):
    13         self.f = open("yszd.json", "w")
    14 
    15     def process_item(self, item, spider):
    16         # ensure_ascii默认是True,会把内容转换为unicode
    17         text = json.dumps(dict(item), ensure_ascii=False) + "
    "
    18         self.f.write(text)
    19         return item
    20 
    21     def close_spider(self, spider):
    22         self.f.close()

    4.setting代码(开启管道,300表示优先级,越小优先级越高)

    5.运行爬虫

      执行命令:scrapy crawl yszd_spider

      注意:yszd_spider为你定义爬虫的名称,与1中的第8行代码对应!

    6.执行结果

  • 相关阅读:
    全面了解 NOSQL
    金融业容灾技术分析
    银行业务知识(转)
    结合工作的业务连续性实践
    金融企业架构
    window 下拉取github项目失败 (Permission denied (publickey))
    vsftpd 配置文件
    nginx下配置虚拟主机
    linux 下安装ftp 并远程连接
    find_in_set
  • 原文地址:https://www.cnblogs.com/yszd/p/10016079.html
Copyright © 2011-2022 走看看