zoukankan      html  css  js  c++  java
  • erlang 分布式数据库Mnesia 实现及应用

    先推荐一篇:mnesia源码分析(yufeng)
     
    - linear hash  
    ETS/DETS/mnesia 都使用了linear hash算法
     
     
    redis dict 的实现类似于linear hash,渐进式rehash,保证操作是O(1)。不过除了每次操作时执行一个bucket的rehash,而且每100ms内使用1ms 执行加快rehash进程。
    虽然虽然rehash过程渐进式的,但在key space过大时,同时使用LRU过期,buckets 这个大数组的malloc 就能让refis卡上一阵子。
    曾遇到的一个案例:现网redis使用主备自动切换模式,有段时间老无故自动切换。排查发现是key space 1000kw+,切换时大量evict,bluckets 需要malloc一个*2的,也就是10M* 24 * 2 = 480M内存,内存一直处于满地状态,靠着LRU替换,此时需要清理出这么大一块,导致redis 实例数秒停止响应导致切换。从这个案例和内存利用率来看,redis 使用时尽量保证keyspace 别太大吧。
     
    - ETS
         Erlang内置数据库挑战7000WQPS
         ETS 实现很简单,就一个内存字典。使用读写锁,只读情况下达到很高的TPS,曾在我老T420笔记本 测试过字典在单核心情况下读写400w/s。从这个测试数据看ETS 的读操作其实和全局内存字典读取速度差不多,效率很高。写性能因为全局锁的关系,不可避免受限且并发越高性能越差。建议对写入频繁ETS做分表操作。
     
     
    - DETS 
    ETS的落地存储方式,有单表2G大小限制,可以有cache 但默认cache 0 也就是默认读写都操作磁盘。
    前面说到DETS 是基于linear hash 存储,hash 方式不是很磁盘友好、不是文件块 cache友好;cache 只是作为行级索引,没有块级索引。
    总的说DETS 和真正完整的存储引擎还有一定差距,单独使用价值不大,所以基本都是用于基于它的Mnesia集群版本来使。

    Since all operations performed by Dets are disk operations, it is important to realize that a single look-up operation involves a series of disk seek and read operations. For this reason, the Dets functions are much slower than the corresponding Ets functions, although Dets exports a similar interface.

    Dets organizes data as a linear hash list and the hash list grows gracefully as more data is inserted into the table. Space management on the file is performed by what is called a buddy system. The current implementation keeps the entire buddy system in RAM, which implies that if the table gets heavily fragmented, quite some memory can be used up. The only way to defragment a table is to close it and then open it again with the repair option set to force.


    - Mnesia 
       基于ETS/DETS, 的纯erlang 实现的强大分布式数据库,而disc Mnesia 表大小受dets 限制,但可以使用fragmentation,frag 类似于分区表。
     
    使用LevelDB 替换DETS(1/4启动时间,1/2冲突,1/3 内存占用)
    Mnesia Backend Plugin Framework and a LevelDB-based Plugin: Roland Karlsson, Malcolm Matalka
     
    whatsapp:
    disc_copies tables
    Partitioned islands and fragmented tables
    All operations run async_dirty
    Use key hashing to collapse all ops per key
    to a single process
     
     

    First of all, mnesia has no 2 gigabyte limit. It is limited on a 32bit architecture, but hardly any are present anymore for real work. And on 64bit, you are not limited to 2 gigabyte. I have seen databases on the order of several hundred gigabytes. The only problem is the initial start-up time for those.

    Mnesia is built to handle:
     
    • Very low latency K/V lookup, not necessarily linearizible.
    • Proper transactions with linearizible changes (C in the CAP theorem). These are allowed to run at a much worse latency as they are expected to be relatively rare.
    • On-line schema change
    • Survival even if nodes fail in a cluster (where cluster is smallish, say 10-50 machines at most)

    The design is such that you avoid a separate process since data is in the Erlang system already. You have QLC for datalog-like queries. And you have the ability to store any Erlang term.

    Mnesia fares well if the above is what you need. Its limits are:

    • You can't get a machine with more than 2 terabytes of memory. And loading 2 teras from scratch is going to be slow.
    • Since it is a CP system and not an AP system, the loss of nodes requires manual intervention. You may not need transactions as well. You might also want to be able to seamlessly add more nodes to the system and so on. For this, Riak is a better choice.
    • It uses optimistic locking which gives trouble if many processes tries to access the same row in a transaction.
  • 相关阅读:
    OpenStack 企业私有云的若干需求(5):主流硬件支持、云快速交付 和 SLA 保证
    OpenStack 企业私有云的若干需求(4):混合云支持 (Hybrid Cloud Support)
    超千个节点OpenStack私有云案例(1):CERN 5000+ 计算节点私有云
    OpenStack 企业私有云的若干需求(3):多租户和租户间隔离(multi-tenancy and isolation)
    理解 Linux 网络栈(3):QEMU/KVM + VxLAN 环境下的 Segmentation Offloading 技术(发送端)
    理解 Linux 网络栈(2):非虚拟化Linux 环境中的 Segmentation Offloading 技术
    理解 Linux 网络栈(1):Linux 网络协议栈简单总结
    矩阵连乘最优结合 动态规划求解
    不用中间变量交换两个数 swap(a,b);
    java中String、StringBuffer、StringBuilder的区别
  • 原文地址:https://www.cnblogs.com/lulu/p/3950527.html
Copyright © 2011-2022 走看看