zoukankan      html  css  js  c++  java
  • Python的scrapy之爬取6毛小说网的圣墟

    闲来无事想看个小说,打算下载到电脑上看,找了半天,没找到可以下载的网站,于是就想自己爬取一下小说内容并保存到本地

    圣墟 第一章 沙漠中的彼岸花 - 辰东 - 6毛小说网  http://www.6mao.com/html/40/40184/12601161.html

    这是要爬取的网页

    观察结构

    下一章

    然后开始创建scrapy项目:

    其中sixmaospider.py:

    # -*- coding: utf-8 -*-
    import scrapy
    from ..items import SixmaoItem
    
    
    class SixmaospiderSpider(scrapy.Spider):
        name = 'sixmaospider'
        #allowed_domains = ['http://www.6mao.com']
        start_urls = ['http://www.6mao.com/html/40/40184/12601161.html']  #圣墟
    
        def parse(self, response):
            novel_biaoti = response.xpath('//div[@id="content"]/h1/text()').extract()
            #print(novel_biaoti)
            novel_neirong=response.xpath('//div[@id="neirong"]/text()').extract()
            print(novel_neirong)
            #print(len(novel_neirong))
            novelitem = SixmaoItem()
            novelitem['novel_biaoti'] = novel_biaoti[0]
            print(novelitem['novel_biaoti'])
    
            for i in range(0,len(novel_neirong),2):
                #print(novel_neirong[i])
    
                novelitem['novel_neirong'] = novel_neirong[i]
    
                yield novelitem
    
            #下一章
            nextPageURL = response.xpath('//div[@class="s_page"]/a/@href').extract()  # 取下一页的地址
            nexturl='http://www.6mao.com'+nextPageURL[2]
            print('下一章',nexturl)
            if nexturl:
                url = response.urljoin(nexturl)
                # 发送下一页请求并调用parse()函数继续解析
                yield scrapy.Request(url, self.parse, dont_filter=False)
                pass
            else:
                print("退出")
            pass

    pipelinesio.py 将内容保存到本地文件

    import os
    print(os.getcwd())
    
    
    class SixmaoPipeline(object):
        def process_item(self, item, spider):
            #print(item['novel'])
    
            with open('./data/圣墟.txt', 'a', encoding='utf-8') as fp:
                fp.write(item['novel_neirong'])
                fp.flush()
                fp.close()
            return item
        print('写入文件成功')

    items.py

    import scrapy
    
    
    class SixmaoItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        novel_biaoti=scrapy.Field()
        novel_neirong=scrapy.Field()
        pass

    startsixmao.py,直接右键这个运行,项目就开始运行了

    from scrapy.cmdline import execute
    
    execute(['scrapy', 'crawl', 'sixmaospider'])

    settings.py

    LOG_LEVEL='INFO'   #这是加日志
    LOG_FILE='novel.log'
    
    DOWNLOADER_MIDDLEWARES = {
        'sixmao.middlewares.SixmaoDownloaderMiddleware': 543,
        'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
        'sixmao.rotate_useragent.RotateUserAgentMiddleware' :400  #这行是使用代理
    }
    
    
    ITEM_PIPELINES = {
        #'sixmao.pipelines.SixmaoPipeline': 300,
        'sixmao.pipelinesio.SixmaoPipeline': 300,
    
    }  #在pipelines输出管道加入这个
    
    SPIDER_MIDDLEWARES = {
       'sixmao.middlewares.SixmaoSpiderMiddleware': 543,
    }  #打开中间件 其余地方应该不需要改变

    rotate_useragent.py  给项目加代理,防止被服务器禁止

    # 导入random模块
    import random
    # 导入useragent用户代理模块中的UserAgentMiddleware类
    from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
    
    # RotateUserAgentMiddleware类,继承 UserAgentMiddleware 父类
    # 作用:创建动态代理列表,随机选取列表中的用户代理头部信息,伪装请求。
    #       绑定爬虫程序的每一次请求,一并发送到访问网址。
    
    # 发爬虫技术:由于很多网站设置反爬虫技术,禁止爬虫程序直接访问网页,
    #             因此需要创建动态代理,将爬虫程序模拟伪装成浏览器进行网页访问。
    class RotateUserAgentMiddleware(UserAgentMiddleware):
        def __init__(self, user_agent=''):
            self.user_agent = user_agent
    
        def process_request(self, request, spider):
            #这句话用于随机轮换user-agent
            ua = random.choice(self.user_agent_list)
            if ua:
                # 输出自动轮换的user-agent
                print(ua)
                request.headers.setdefault('User-Agent', ua)
    
        # the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
        # for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
        # 编写头部请求代理列表
        user_agent_list = [
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"
            "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
           ]

    最终运行结果:

    呐呐呐,这就是一个小的scrapy项目了

  • 相关阅读:
    对javascript匿名函数的理解(透彻版)
    使用Emmet(前身Zen Coding)加速Web前端开发
    实现IE6-Ie8媒体查询
    css3常用伪类选择器
    ScriptManager,updatepanel中按钮事件不兼容IE10以后版本
    UpdatePanel中弹出不能弹出消息
    页面缓冲显示正在加载图片
    练习JsonJquery查找数据
    Ajax练习:使用jQuery验证用户名是否存在
    使用Android SDK中的WebView
  • 原文地址:https://www.cnblogs.com/yuxuanlian/p/9852492.html
Copyright © 2011-2022 走看看