zoukankan      html  css  js  c++  java
  • 爬虫.微博数据的存储:分布式数据库及应用

    分布式爬虫系统

      

    简单的分布式爬虫

      分布式爬虫的作用:1.解决目标地址对IP访问频率的限制

               2.利用更高的宽带,提高下载速度

               3.大规模系统的分布式存储和备份

               4.数据的扩展能力

    将多进程爬虫部署到多台主机上

      将数据库地址配置到统一的服务器上

      将数据库设置仅允许特定IP来源的访问请求

      设置防护墙,允许端口远程连接

    分布式爬虫系统-爬虫

    分布式存储

       爬虫原数据存储特点

        1.文件小,大量KB级别的文件

        2.文件数量大

        3.增量方式一次性写入,极少需要修改

        4.顺序读取

        5.并发的文件读写

        6.可扩展

    Googls FS

        

    HDFS

      Distributed,Scalable,Portable,File System

      Written in Java

      Not fully POSIX-compliant

      Replication : 3 copies by default

      Designed for immutable files

      Files are cached and chunked ,chunk size 64MB

                

    Python hdfs module

      Installation : pip install hdfs

        Methods :      Desc

        read()        read a file

        write()        write a file

        delete()      Remove a file or directory from HDFS

        rename()      Move a file or folder

        download()      Download a file or folder from HDFS and save it locally

         list()        Return names of files contained in a remote folder

        makedirs()       Create a remote directory , recursively if necessary

        resolve()       Return absolute , normalized path , with special markers expanded

        upload()        Upload a file or directory to HDFS

        walk()       Depth-first walk of remote filesystem

    存储到HDFS

      from hdfs import *

      from hdfs.util import HdfsError  

      hdfs_client=InsecureClient ('[ host ] : [ port ]',user='user')

      with hdfs_client.write('/htmls/mfw/%s.html'%(filename)) as writer :

          writer.write(html_page)

      except HdfsError,Arguments :

          print Arguments

    HBASE

      on top of HDFS

      Column-oriented database

      Can store huge size raw data

      KEY-VALUE

    HDFS   和    HBASE

        

    HBASE

      HBase is a column-oriented database and the tables in it are sorted by row. The table schema defines only column families,which are the key value pairs.

      A table have multiple column families and each column family can have any number of column. Subsequent column values are stored contiguously on the

      disk . Each cell value of the table has a timestamp. In short, in an HBase :

        1.Table is a collection of rows

        2.Row is a collection of column families

        3.Column family is a collection of columns 

        4.Column is a collection of key value pairs

          

        

    分布式爬虫系统—存储

      

    分布式爬虫—数据库

      

    MongoDB

      

    MongoDB

      Schema less - MongoDB is a document database in which one collection holds different documents. Number of fields,content and size of the document can differ from one 

      document to another.

      Structure of a single object is clear.

      No complex joins .

      Deep query-ability. MongoDB supports dynamic queries  on documents using a document-based query language that's nearly as powerful as SQL.

      Ease of scale-out-MongoDB is easy to scale .

      Conversion/ mapping of application objects to database objects not needed

    Installation

      download

      https://www.mongodb.com/download-center?jmp=nav#community

      https://fastdl.mongodb.org/linux/mongodb-linux-x86_64-amazon-3.4.2.tgz

      setup

      mkdir mongodb

      tar xzvf mongodb-liunx-x86_64-amazon-3.4.2.tgz-C mongodb

      client

      mongo

    Mongo DB

      db.collection.findOneAndUpdate(filter,update,options)

        Returns one document that satisfies the specified query criteria.

        Returns the first document according to natural order, means insert order

        Find and update are done atomically

      MongoClient methods :

      db.spider.mfw.find_one_and_uodate()

    数据库类型

       

    Redis Overview

      基于KEY  VALUE 模式的内存数据库

      支持复杂的对象模型(MemoryCached 仅支持少量类型)

      支持Replication,实现集群(MemoryCached 不支持分布式部署)

      所有操作都是原子性(MemoryCached 多数操作都不是原子的)

      可以序列化到磁盘(MemoryCached 不能序列化)

    Redis Environment Setup

      downlod

      $ wget http://download.redis.io/releases/redis-3.2.7.tar.gz

      $ tar xzf redis-3.2.7.tar.gz

      $ cd redis-3.2.7

      $ make

      Start server and cli

      $ nohup src/ redis-server& 

      $ src/redis-cli

      Test it

      redis >set foo bar

      OK

      redis >get foo

      "bar"

    python Redis

      Installation

      $ sudo pip install redis

      Sample Code

      >>>import redis

      >>>r=redis.StricRedis(host='localhost',port=6379,db=0)

      >>>

      >>>r.set(‘foo’,‘bar’)

      True

      >>>r.get(‘foo’)

      ‘bar’

        

    Mongo的优化

      url作为_id,默认会被创建索引,创建索引是需要额外开销的

      index尽量简单,url长一些

      dequeueUrl find_one()并没有利用index,会全库扫描,但是仍然会很快,因为扫描到第一个后就停止了,但是当下载完后的数量特别大的时候,扫描依然是很费时的,考虑一下能不能进一步优化

      插入的操作很频繁,每一个网页对应着几百次插入,到了depth=3的时候,基数网页是百万级,插入检查将会是亿级,考虑使用更高效的方式来检查

    Mongo with Redis

        

    status:create index OR in different collections

    Code Snippet

      

  • 相关阅读:
    【BZOJ1294】围豆豆(SCOI2009)-状压+BFS
    【BZOJ3244】树的计数(NOI2013)-概率期望+数学证明
    【BZOJ4826】影魔(AHOI&HNOI2017)-线段树+离线
    【BZOJ3832】Rally(POI2014)-拓扑排序+最长路+堆
    【BZOJ4556】字符串(TJOI&HEOI2016)-后缀数组+二分+RMQ+主席树
    【BZOJ1029】建筑抢修(JSOI2007)-贪心+堆
    【BZOJ1057】棋盘制作(ZJOI2007)-DP+悬线法
    【BZOJ1025】游戏(SCOI2009)-数论+背包DP
    java日期操作
    list类型for遍历
  • 原文地址:https://www.cnblogs.com/jacky912/p/10531315.html
Copyright © 2011-2022 走看看