zoukankan      html  css  js  c++  java
  • Scrapy框架: 通用爬虫之XMLFeedSpider

    步骤01: 创建项目

    scrapy startproject xmlfeedspider
    

    步骤02: 使用XMLFeedSpider模版创建爬虫

    scrapy genspider -t xmlfeed jobbole jobbole.com
    

    步骤03: 修改items.py

    import scrapy
    
    class JobboleItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        # 文章标题
        title = scrapy.Field()
        # 发表日期
        public_date = scrapy.Field()
        # 文章链接
        link = scrapy.Field()
    

    步骤04: 配置爬虫文件jobbole.py

    # -*- coding: utf-8 -*-
    from scrapy.spiders import XMLFeedSpider
    # 导入item
    from xmlfeedspider.items import JobboleItem
    
    class JobboleSpider(XMLFeedSpider):
        name = 'jobbole'
        allowed_domains = ['jobbole.com']
        start_urls = ['http://top.jobbole.com/feed/']
        iterator = 'iternodes'  # 迭代器,不指定的话默认是iternodes
        itertag = 'item'  # 抓取item节点
    
        def parse_node(self, response, selector):
            item = JobboleItem()
            item['title'] = selector.css('title::text').extract_first()
            item['public_date'] = selector.css('pubDate::text').extract_first()
            item['link'] = selector.css('link::text').extract_first()
            return item
    
  • 相关阅读:
    Docker Harbor安装和使用
    k8s部署使用Jenkins
    K8S之Deployment
    K8S之StatefulSet
    Gitlab数据迁移和版本升级
    centos7 编译安装git工具
    K8S之secret
    SonarQube的安装和使用
    Jenkins常用构建工具
    el-upload上传/预览时dialog宽自适应
  • 原文地址:https://www.cnblogs.com/hankleo/p/11872571.html
Copyright © 2011-2022 走看看