zoukankan      html  css  js  c++  java
  • python爬虫学习笔记(二十二)-Scrapy框架 案例实现

    爬取小说

    spider

    import scrapy
    from xiaoshuo.items import XiaoshuoItem
    
    
    class XiaoshuoSpiderSpider(scrapy.Spider):
        name = 'xiaoshuo_spider'
        allowed_domains = ['zy200.com']
        url = 'http://www.zy200.com/5/5943/'
        start_urls = [url + '11667352.html']
    
        def parse(self, response):
            info = response.xpath("/html/body/div[@id='content']/text()").extract()
            href = response.xpath("//div[@class='zfootbar']/a[3]/@href").extract_first()
            xs_item = XiaoshuoItem()
            xs_item['content'] = info
            yield xs_item
    
            if href != 'index.html':
                new_url = self.url + href
                yield scrapy.Request(new_url, callback=self.parse)
    

    items

    import scrapy
    
    
    class XiaoshuoItem(scrapy.Item):
        # define the fields for your item here like:
        content = scrapy.Field()
        href = scrapy.Field()
    
    

    pipeline

    class XiaoshuoPipeline(object):
        def __init__(self):
            self.filename = open("dp1.txt", "w", encoding="utf-8")
    
        def process_item(self, item, spider):
            content = item["title"] + item["content"] + '
    '
            self.filename.write(content)
            self.filename.flush()
            return item
    
        def close_spider(self, spider):
            self.filename.close()
    
  • 相关阅读:
    PHP 计算程序运行的时间
    PHP 简易版生成随机码
    PHP读取FLASH 文件信息
    MongoDB基本使用
    PHP实现QQ达人信息抓取
    bjtuOJ 1188 素数筛选
    bjtuOJ 1139 Longest Common Subsequence
    BJTU1113扫雷问题
    C#线程池的使用详解
    C#域名解析的简单制作
  • 原文地址:https://www.cnblogs.com/thresh/p/13349394.html
Copyright © 2011-2022 走看看