zoukankan      html  css  js  c++  java
  • Introducing shard translator

    Introducing shard translator

    by  on December 23, 2015

    GlusterFS-3.7.0 saw the release of sharding feature, among several others. The feature was tagged as “experimental” as it was still in the initial stages of development back then. Here is some introduction to the feature:

    Why shard translator?

    GlusterFS’ answer to very large files (those which can grow beyond a single brick) had never been clear. There is a stripe translator which allows you to do that, but that comes at a cost of flexibility – you can add servers only in multiple of stripe-count x replica-count, mixing striped and unstriped files is not possible in an “elegant” way. This also happens to be a big limiting factor for the big data/Hadoop use case where super large files are the norm (and where you want to split a file even if it could fit within a single server.) The proposed solution for this is to replace the current stripe translator with a new “shard” translator.

    What?

    Unlike stripe, Shard is not a cluster translator. It is placed on top of DHT. Initially all files will be created as normal files, even up to a certain configurable size. The first block (default 4MB) will be stored like a normal file under its parent directory. However further blocks will be stored in a file, named by the GFID and block index in a separate namespace (like /.shard/GFID1.1, /.shard/GFID1.2 … /.shard/GFID1.N). File IO happening to a particular offset will write to the appropriate “piece file”, creating it if necessary. The aggregated file size and block count will be stored in the xattr of the original file.

    Usage:

    Here I have a 2×2 distributed-replicated volume.

    # gluster volume info
    Volume Name: dis-rep
    Type: Distributed-Replicate
    Volume ID: 96001645-a020-467b-8153-2589e3a0dee3
    Status: Started
    Number of Bricks: 2 x 2 = 4
    Transport-type: tcp
    Bricks:
    Brick1: server1:/bricks/1
    Brick2: server2:/bricks/2
    Brick3: server3:/bricks/3
    Brick4: server4:/bricks/4
    Options Reconfigured:
    performance.readdir-ahead: on

    To enable sharding on it, this is what I do:

    # gluster volume set dis-rep features.shard on
    volume set: success

    Now, to configure the shard block size to 16MB, this is what I do:

    # gluster volume set dis-rep features.shard-block-size 16MB
    volume set: success

    How files are sharded:

    Now I write 84MB of data into a file named ‘testfile’.

    # dd if=/dev/urandom of=/mnt/glusterfs/testfile bs=1M count=84
    84+0 records in
    84+0 records out
    88080384 bytes (88 MB) copied, 13.2243 s, 6.7 MB/s

    Let’s check the backend to see how the file was sharded to pieces and how these pieces got distributed across the bricks:

    # ls /bricks/* -lh
    /bricks/1:
    total 0

    /bricks/2:
    total 0

    /bricks/3:
    total 17M
    -rw-r–r–. 2 root root 16M Dec 24 12:36 testfile

    /bricks/4:
    total 17M
    -rw-r–r–. 2 root root 16M Dec 24 12:36 testfile

    So the file hashed to the second replica set (brick3 and brick4 which form a replica pair) and 16M in size. Where did the remaining 68MB worth of data go? To find out, let’s check the contents of the hidden directory .shard on all bricks:

    # ls /bricks/*/.shard -lh
    /bricks/1/.shard:
    total 37M
    -rw-r–r–. 2 root root  16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.1
    -rw-r–r–. 2 root root  16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.3
    -rw-r–r–. 2 root root 4.0M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.5

    /bricks/2/.shard:
    total 37M
    -rw-r–r–. 2 root root  16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.1
    -rw-r–r–. 2 root root  16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.3
    -rw-r–r–. 2 root root 4.0M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.5

    /bricks/3/.shard:
    total 33M
    -rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.2
    -rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.4

    /bricks/4/.shard:
    total 33M
    -rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.2
    -rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.4

    So, the file was basically split into 6 pieces: 5 of them residing in the hidden directory “/.shard” distributed across replica sets based on disk space availability and the file name hash, and the first block residing in its native parent directory. Notice how blocks 1 through 4 are all of size 16M and the last block (block-5) is 4M in size.

    Now let’s do some math to see how ‘testfile’ was “sharded”:

    The total size of the write was 84MB. And the configured block size in this case is 16MB. So (84MB divided by 16MB) = 5 with remainder = 4MB

    So the file was basically broken into 6 pieces in all, with the last piece having 4MB of data and the rest of them 16MB in size.

    Now when we view the file from the mount point, it would appear as one single file:

    # ls -lh /mnt/glusterfs/
    total 85M
    -rw-r–r–. 1 root root 84M Dec 24 12:36 testfile

    Notice how the file is shown to be of size 84MB on the mount point. Similarly, when the file is read by an application, the different pieces or ‘shards’ are stitched together and appropriately presented to the application as if there was no chunking done at all.

    Advantages of sharding:

    The advantage of sharding a file over striping it across a finite set of bricks are:

    • Data blocks are distributed by DHT in a “normal way”.
    • Adding servers can happen in any number (even one at a time) and DHT’s rebalance will spread out the “piece files” evenly.
    • Sharding provides better utilization of disk space. Now it is no longer necessary to have at least one brick of size X in order to accommodate a file of size X, where X is really large. Consider this example: A distribute volume is made up of 3 bricks of size 10GB, 20GB, 30GB. With this configuration, it is impossible to store a file greater than 30GB in size on this volume. Sharding eliminates this limitation. A file of upto 60GB size can be stored on this volume with sharding.
    • Self-healing of a large file is now more distributed into smaller files across more servers leading to better heal performance and lesser CPU usage, which is particularly a pain point for large file workloads.
    • piece file naming scheme is immune to renames and hardlinks.
    • When geo-replicating a large file to a remote volume, only the shards that changed can be synced to the slave, considerably reducing the sync time.
    • When sharding is used in conjunction with tiering, only the shards that change would be promoted/demoted. This reduces the amount of data that needs to be migrated between hot and cold tier.
    • When sharding is used in conjunction with bit-rot detection feature of GlusterFS, the checksum is computed on smaller shards as opposed to one large file.

    Yes, sharding in its current form is not compatible with directory quota. This is something we are going to focus on, in the coming days – to make it compatible with other Gluster features (including directory quota and user/group quota which is a feature in design phase).

    Thanks,
    Krutika

  • 相关阅读:
    dirname,basename的用法与用途
    终极解决方案——sbt配置阿里镜像源,解决sbt下载慢,dump project structure from sbt耗时问题
    linux-manjaro下添加Yahei Hybrid Consola字体
    Idea无法调出搜狗等中文输入法
    Spring 源码学习系列
    BF算法
    Mybatis Mapper接口是如何找到实现类的-源码分析
    Lua脚本在redis分布式锁场景的运用
    GO语言一行代码实现反向代理
    SpringMVC源码分析-400异常处理流程及解决方法
  • 原文地址:https://www.cnblogs.com/vman/p/5074459.html
Copyright © 2011-2022 走看看