zoukankan      html  css  js  c++  java
  • Scrapy学习笔记

    1.Scrapy是什么

    Scrapy是基于twisted的爬虫框架,用户定制开发几个模块就可以实现爬虫

    2.Scrapy的优势

    没有Scrapy要自己手写爬虫的时候,我们要用Urlib或Requests库发送请求、封装http头部信息类、多线程、封装代理类、封装去重类、封装数据存储类、封装去重类、封装异常检测机制

    3.Scrapy架构

    Scrapy Engine:Scrapy的引擎。它负责Scheduler,Pipeline,Spiders,Downloader之间的信号、消息和通讯传递

    Scheduler:Scrapy的调度器。简单地说是队列,接受Scrapy Engine发送来的Request,Scheduler对它们进行排队,当Scrapy Engine需要数据时,Scheduler将请求队列中的数据传送给引擎

    Downloader:Scrapy的下载器。它负责接受Scrapy Engine的Request,生成Response,并将其交还给Scrapy Engine,引擎再将Response交给Spiders

    Spiders:Scrapy的爬虫。它用来写爬虫逻辑,如编写正则,BeautifulSoup,Xpath等;如果Response包含下一次请求,如“下一页”,Spiders会将URL交给Scrapy Engine,再有引擎交给Scheduler进行排队

    Pipeline:Scrapy的管道。封装去重类、存储类的地方,负责数据的后期过滤、存储等

    Downloader:下载器。它负责发送请求并下载数据

    Downloader Middlewares:下载中间件。自定义扩展组件,是我们封装代理、封装HTTP头的地方

    Spider Middlewares:爬虫中间件。可以封装从Spiders发送出去的Request和接受到的Response

    4.Scrapy例子

    4.1 爬取豆瓣电影Top250

    搭建Scapy项目的教程网上有很多,可以自行百度

    自定义代理中间件,这里用到了本地Ip代理,大量爬虫请求的话需要接入第三方代理工具。可以将爬取源Ip伪装成如下代理

    class specified_proxy(object):
        def proccess_request(self,request,spider):
            #随机选取代理Ip
            PROXIES = ['http://183.207.95.27:80', 'http://111.6.100.99:80', 'http://122.72.99.103:80',
                       'http://106.46.132.2:80', 'http://112.16.4.99:81', 'http://123.58.166.113:9000',
                       'http://118.178.124.33:3128', 'http://116.62.11.138:3128', 'http://121.42.176.133:3128',
                       'http://111.13.2.131:80', 'http://111.13.7.117:80', 'http://121.248.112.20:3128',
                       'http://112.5.56.108:3128', 'http://42.51.26.79:3128', 'http://183.232.65.201:3128',
                       'http://118.190.14.150:3128', 'http://123.57.221.41:3128', 'http://183.232.65.203:3128',
                       'http://166.111.77.32:3128', 'http://42.202.130.246:3128', 'http://122.228.25.97:8101',
                       'http://61.136.163.245:3128', 'http://121.40.23.227:3128', 'http://123.96.6.216:808',
                       'http://59.61.72.202:8080', 'http://114.141.166.242:80', 'http://61.136.163.246:3128',
                       'http://60.31.239.166:3128', 'http://114.55.31.115:3128', 'http://202.85.213.220:3128']
            random_proxy = random.sample(PROXIES, 1)
            request.meta['proxy'] = random_proxy

    自定义user_agent,让目标服务器知道我们不是机器,而是从操作系统、浏览器等发出的请求

    class specified_useragent(object):
        def proccess_request(self, request, spider):
            #随机选取user_agent
            USER_AGENT_LIST = [
                "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
                "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
                "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
                "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
                "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
                "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
                "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
                "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
                "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
                "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
                "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
                "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
                "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
                "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
            ]
            agent = random.choice(USER_AGENT_LIST)
            request.headers['USER_AGNET'] = agent

    配置完自定义中间件,要在Settings.py中引用它们

    #数字越小优先级越高
    DOWNLOADER_MIDDLEWARES = {'ScrapyTest.middlewares.specified_proxy': 543,
        'ScrapyTest.middlewares.specified_useragent': 544
    }

    在items.py里定义数据

    import scrapy
    
    
    class ScrapytestItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        #电影序号
        serial_number = scrapy.Field();
        #电影名称
        movie_name = scrapy .Field();
        #电影介绍
        introduce = scrapy.Field();
        #评分
        star = scrapy.Field();
        #电影的评论数
        evaluate = scrapy.Field();
        #电影描述
        describe = scrapy.Field();
        pass

    在管道pipelines.py中配置数据的存储,连接Monodb

    class ScrapytestPipeline(object):
        def __init__(self):
            host = monodb_host
            port = monodb_port
            dbname = monodb_db_name
            sheetname = monodb_tb_name
            client = pymongo.MongoClient(host=host,port=port)
            mydb = client[dbname]
            self.post = mydb[sheetname]
    
        def process_item(self, item, spider):
            data = dict(item)
            self.post.insert(data)
            return item

    settings.py数据库信息

    monodb_host = "127.0.0.1"
    monodb_port = 27017
    monodb_db_name = "scrapy_test"
    monodb_tb_name = "douban_movie"

    运行main后的效果

    在Mongodb数据库中可以看到插入进来的数据

    use scrapy_test;
    show collections;
    db.douban_movie.find().pretty()

    4.2 源码获取

    https://github.com/cjy513203427/ScrapyTest

  • 相关阅读:
    Lucene 02
    企业级-Shell案例5——找出占用CPU 内存过高的进程
    企业级-Shell案例4——一键查看服务器利用率
    企业级-Shell案例3——批量创建多个用户并设置密码
    企业级-Shell案例2——发送告警邮件
    企业级-Shell案例1——服务器系统配置初始化
    Centos搭建docker swarm集群详细教程
    Promethus(普罗米修斯)的Grafana+onealert实现报警功能
    Promethus的Grafana图形显示MySQL监控数据
    Promethus(普罗米修斯)安装Grafana可视化图形工具
  • 原文地址:https://www.cnblogs.com/Java-Starter/p/10021133.html
Copyright © 2011-2022 走看看