zoukankan      html  css  js  c++  java
  • python爬虫学习笔记(二十二)-Scrapy框架 案例实现

    爬取小说

    spider

    import scrapy
    from xiaoshuo.items import XiaoshuoItem
    
    
    class XiaoshuoSpiderSpider(scrapy.Spider):
        name = 'xiaoshuo_spider'
        allowed_domains = ['zy200.com']
        url = 'http://www.zy200.com/5/5943/'
        start_urls = [url + '11667352.html']
    
        def parse(self, response):
            info = response.xpath("/html/body/div[@id='content']/text()").extract()
            href = response.xpath("//div[@class='zfootbar']/a[3]/@href").extract_first()
            xs_item = XiaoshuoItem()
            xs_item['content'] = info
            yield xs_item
    
            if href != 'index.html':
                new_url = self.url + href
                yield scrapy.Request(new_url, callback=self.parse)
    

    items

    import scrapy
    
    
    class XiaoshuoItem(scrapy.Item):
        # define the fields for your item here like:
        content = scrapy.Field()
        href = scrapy.Field()
    
    

    pipeline

    class XiaoshuoPipeline(object):
        def __init__(self):
            self.filename = open("dp1.txt", "w", encoding="utf-8")
    
        def process_item(self, item, spider):
            content = item["title"] + item["content"] + '
    '
            self.filename.write(content)
            self.filename.flush()
            return item
    
        def close_spider(self, spider):
            self.filename.close()
    
  • 相关阅读:
    杜教筛
    虚树
    带修莫队
    线性基
    区间修改区间求和cdq分治
    矩阵快速幂求斐波那契数列
    点分治成品
    Codeforces Round #542 [Alex Lopashev Thanks-Round] (Div. 1) C(二分+KMP)
    线性筛
    矩阵快速幂
  • 原文地址:https://www.cnblogs.com/thresh/p/13349394.html
Copyright © 2011-2022 走看看