zoukankan      html  css  js  c++  java
  • scrapy爬虫框架(二)

    scrapy爬虫框架(二)

    将数据保存到json文件中

    settings.py打开pipeline,其中数字代表优先级(值越小优先级越高)

    # Configure item pipelines
    # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
       'qsbkSpider.pipelines.QsbkspiderPipeline': 300,
    }
    

    qsbk.py

    # -*- coding: utf-8 -*-
    import scrapy
    
    
    class QsbkSpider(scrapy.Spider):
        name = 'qsbk'
        allowed_domains = ['www.yicommunity.com']
        start_urls = ['http://www.yicommunity.com/']
    
        def parse(self, response):
            print("=" * 80)
            contents = response.xpath('//div[@class="col1"]/div')
            print(contents)
            print("=" * 80)
            for content in contents:
                author = content.xpath("./div[@class='author']/text()").get()
                word = content.xpath("./div[@class='content']/text()").get()
                print(author, word)
                duanzi = {"author": author, "word": word}
                # 从函数变成生成器,遍历生成器的时候就会一个一个返回回去
                yield duanzi  # 移交给引擎,引擎再移交给pipeline
    
    

    pipeline.py

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    import json
    
    
    class QsbkspiderPipeline(object):
        def __init__(self):  # 初始化方法
            self.fp = open("duanzi.json", "w", encoding='utf-8')
    
        def process_item(self, item, spider):
            item_json = json.dumps(item)
            self.fp.write(item_json + '
    ')
            return item
    
        def open_spider(self, spider):
            print("爬虫开始了!")
    
        def close_spider(self, spider):
            self.fp.close()
            print("爬虫结束了!")
    
    

    运行效果

    scrapy crawl qsbk
    

    mark

    同时生成文件

    mark


    优化

    mark

    mark

    mark

    mark

  • 相关阅读:
    《the art of software testing》 第三章 人工测试
    unbutu下wireshark编译安装(已更新)
    Cygwin工具的简单使用
    第三周Linux编程实例练习
    ceph如何快速卸载所有osd及擦除磁盘分区表和内容并重新加入
    Redis集群的分布式部署
    redis主从同步
    redis编译安装
    kubeadm部署k8s
    openstack高可用集群19-linuxbridge结合vxlan
  • 原文地址:https://www.cnblogs.com/senup/p/12319119.html
Copyright © 2011-2022 走看看