zoukankan      html  css  js  c++  java
  • Python Scrapy 爬虫入门

    1、目标地址 http://quotes.toscrape.com

    将页面的文章内容和作者爬下来,并保存到json文件里面。

    下面代码:

    用到的工具:scrapy ,xpath选择器,json,codecs编码

    爬虫代码:

    class ScrapeSpider(scrapy.Spider):
        name = 'toscrape'
        allowed_domains = ['toscrape.com']
    
        start_urls = [
            'http://quotes.toscrape.com'
        ]
    
        def parse(self, response):
            for quote in response.xpath('//div[@class="quote"]'):
                item = QuoteItem()
                item["text"] = quote.xpath('./span[@class="text"]/text()').get()
                item['author'] = quote.xpath('./span/small[@class="author"]/text()').get()
                yield item
    
            next_page = response.xpath('//nav/ul/li[@class="next"]/a/@href').get()
            if next_page and len(next_page) > 0:
                yield response.follow(next_page, self.parse)

    在items.py 中添加数据

    class QuoteItem(scrapy.Item):
        text = scrapy.Field()
        author = scrapy.Field()
        pass

    定义pipelines: 保存到quotes.json文件中

    import json
    import codecs

    class
    QuotePipeline(object): def __init__(self): self.file = codecs.open('quotes.json', 'wb', encoding='utf-8') pass def process_item(self, item, spider): lines = json.dumps(dict(item), ensure_ascii=False) + " " self.file.write(lines) return item def close_spider(self, spider): self.file.close() pass

    之后执行

      

    scrapy crawl toscrape

    爬下来的数据:

  • 相关阅读:
    函数1
    函数
    VC++中GDI和GDI+ 的坐标系统介绍
    CWnd与HWND的区别与转换
    VC++下的Unicode编程
    VS 和Visual Assist X快捷键(转)
    VC中CRect类的简单介绍
    ListControl的用法
    VC:GetWindowRect、GetClientRect、ScreenToClient与ClientToScreen
    VC中CDC与HDC的区别以及二者之间的转换
  • 原文地址:https://www.cnblogs.com/roger-jc/p/12011384.html
Copyright © 2011-2022 走看看