zoukankan      html  css  js  c++  java
  • python从入门到放弃自学笔记1-scrapy框架的简单实例

    最近做的爬取比较多,查看网上的代码很多都用到了scrapy框架。下面是一个简单的scrapy爬取实例(环境为python3.8+pycharm):

    (1)右击项目目录->open in terminal输入下面代码创建Scapy初始化项目:

    scrapy startproject qsbk 

    (2)建立一个爬虫,爬虫的名称为qsbk_spider,爬虫要爬取的网站范围为"http://www.lovehhy.net"

    scrapy genspider qsbk_spider "http://www.lovehhy.net"

    (3)配置settings文件:

    BOT_NAME = 'qsbk'
    
    SPIDER_MODULES = ['qsbk.spiders']
    NEWSPIDER_MODULE = 'qsbk.spiders'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    
    # Override the default request headers:
    DEFAULT_REQUEST_HEADERS = {
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Language': 'en',
       'User Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
    }
    
    #项目中只要是需要pipelines操作的,此注释就需要打开
    ITEM_PIPELINES = {
       'qsbk.pipelines.QsbkPipeline': 300,
    }

    (4)items配置,这里的items与javaweb中的javabean用法类似,就像是一个类,里面可以自定义需要爬取的字段的名称

    import scrapy
    
    class QsbkItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        title = scrapy.Field()
        time = scrapy.Field()

    (5)编写爬虫的代码:

    爬虫代码中的parse作用是产生item,传给pipelines来对数据进行储存

    import scrapy
    
    from qsbk.items import QsbkItem
    
    class QsbkSpiderSpider(scrapy.Spider):
        name = 'qsbk_spider'
        start_urls = ['http://www.lovehhy.net/Joke/Detail/QSBK/1']
        baseUrl = "http://www.lovehhy.net"
    
        def parse(self, response):
    
            node_title_list = response.xpath("//div[@class='post_recommend_new']/h3/a/text()").extract()
    node_time_list = response.xpath("//div[@class='post_recommend_new']/div[@class='post_recommend_time']/text()").extract()
    items = [] for i in range(len(node_title_list)): item = QsbkItem() title = node_title_list[i] content = node_time_list[i]
    item
    = QsbkItem(title=title, time=time) yield item

    (6)编写pipelines代码对数据进行储存:

    这里储存到了csv数据集文件中

    import csv
    
    
    class ScPipeline(object):
        def __init__(self):
            self.file = open("mmm.csv", 'w+', newline="", encoding='utf-8')
            self.writer = csv.writer(self.file)
    
    
        def open_spider(self, spider):
            print("爬虫开始了...")
    
        def process_item(self, item, spider):
            self.writer.writerow([item['title'], item['time']])
            return item
    
        def close_spider(self, spider):
            self.file.close()
            print("爬虫结束了...")

    最后看一下我们爬取的结果:

    还是在命令行中输入下面内容来启动爬虫

    scrapy crawl qsbk_spider

    爬取结果:

  • 相关阅读:
    POJ 3660 Cow Contest (floyd求联通关系)
    POJ 3660 Cow Contest (最短路dijkstra)
    POJ 1860 Currency Exchange (bellman-ford判负环)
    POJ 3268 Silver Cow Party (最短路dijkstra)
    POJ 1679 The Unique MST (最小生成树)
    POJ 3026 Borg Maze (最小生成树)
    HDU 4891 The Great Pan (模拟)
    HDU 4950 Monster (水题)
    URAL 2040 Palindromes and Super Abilities 2 (回文自动机)
    URAL 2037 Richness of binary words (回文子串,找规律)
  • 原文地址:https://www.cnblogs.com/123456www/p/12349841.html
Copyright © 2011-2022 走看看