zoukankan html css js c++ java

【原创】大数据基础之Hadoop（3）hdfs diskbalancer

hdfs单个节点内多个磁盘不均衡时（比如新加磁盘），需要手工进行diskbalancer操作，命令如下

# hdfs diskbalancer -help plan
usage: hdfs diskbalancer -plan <hostname> [options]
Creates a plan that describes how much data should be moved between disks.
 
 
    --bandwidth <arg>             Maximum disk bandwidth (MB/s) in integer
                                  to be consumed by diskBalancer. e.g. 10
                                  MB/s.
    --maxerror <arg>              Describes how many errors can be
                                  tolerated while copying between a pair
                                  of disks.
    --out <arg>                   Local path of file to write output to,
                                  if not specified defaults will be used.
    --plan <arg>                  Hostname, IP address or UUID of datanode
                                  for which a plan is created.
    --thresholdPercentage <arg>   Percentage of data skew that is
                                  tolerated before disk balancer starts
                                  working. For example, if total data on a
                                  2 disk node is 100 GB then disk balancer
                                  calculates the expected value on each
                                  disk, which is 50 GB. If the tolerance
                                  is 10% then data on a single disk needs
                                  to be more than 60 GB (50 GB + 10%
                                  tolerance value) for Disk balancer to
                                  balance the disks.
    --v                           Print out the summary of the plan on
                                  console

其中thresholdPercentage的注释有歧义，看起来是根据绝对值进行均衡的，查看代码

org.apache.hadoop.hdfs.server.diskbalancer.datamodel.DiskBalancerVolumeSet

/**
 * Computes Volume Data Density. Adding a new volume changes
 * the volumeDataDensity for all volumes. So we throw away
 * our priority queue and recompute everything.
 *
 * we discard failed volumes from this computation.
 *
 * totalCapacity = totalCapacity of this volumeSet
 * totalUsed = totalDfsUsed for this volumeSet
 * idealUsed = totalUsed / totalCapacity
 * dfsUsedRatio = dfsUsedOnAVolume / Capacity On that Volume
 * volumeDataDensity = idealUsed - dfsUsedRatio
 */
public void computeVolumeDataDensity() {
  long totalCapacity = 0;
  long totalUsed = 0;
  sortedQueue.clear();
 
  // when we plan to re-distribute data we need to make
  // sure that we skip failed volumes.
  for (DiskBalancerVolume volume : volumes) {
    if (!volume.isFailed() && !volume.isSkip()) {
 
      if (volume.computeEffectiveCapacity() < 0) {
        skipMisConfiguredVolume(volume);
        continue;
      }
 
      totalCapacity += volume.computeEffectiveCapacity();
      totalUsed += volume.getUsed();
    }
  }
 
  if (totalCapacity != 0) {
    this.idealUsed = truncateDecimals(totalUsed /
        (double) totalCapacity);
  }
 
  for (DiskBalancerVolume volume : volumes) {
    if (!volume.isFailed() && !volume.isSkip()) {
      double dfsUsedRatio =
          truncateDecimals(volume.getUsed() /
              (double) volume.computeEffectiveCapacity());
 
      volume.setVolumeDataDensity(this.idealUsed - dfsUsedRatio);
      sortedQueue.add(volume);
    }
  }
}
 
 
/**
 * Computes whether we need to do any balancing on this volume Set at all.
 * It checks if any disks are out of threshold value
 *
 * @param thresholdPercentage - threshold - in percentage
 *
 * @return true if balancing is needed false otherwise.
 */
public boolean isBalancingNeeded(double thresholdPercentage) {
  double threshold = thresholdPercentage / 100.0d;
 
  if(volumes == null || volumes.size() <= 1) {
    // there is nothing we can do with a single volume.
    // so no planning needed.
    return false;
  }
 
  for (DiskBalancerVolume vol : volumes) {
    boolean notSkip = !vol.isFailed() && !vol.isTransient() && !vol.isSkip();
    Double absDensity =
        truncateDecimals(Math.abs(vol.getVolumeDataDensity()));
 
    if ((absDensity > threshold) && notSkip) {
      return true;
    }
  }
  return false;
}

主要有两个函数，

computeVolumeDataDensity：查看一个盘的数据密度，计算方法为当前盘的空间占用比例（dfsUsedRatio）- 所有盘的空间占用比例（idealUsed）
isBalancingNeeded：判断一个盘是否需要均衡，即数据密度的绝对值是否超过参数设置（thresholdPercentage）

所以实际均衡的时候考虑的是空间占用比例，而不是空间占用绝对值

---------------------------------------------------------------- 结束啦，我是大魔王先生的分割线：) ----------------------------------------------------------------

由于大魔王先生能力有限，文中可能存在错误，欢迎指正、补充！

感谢您的阅读，如果文章对您有用，那么请为大魔王先生轻轻点个赞，ありがとう

查看全文

相关阅读:
福州KTV
MSN登陆不上：微软谴责中国的“技术问题”
DB2 存储过程开发最佳实践
 在DB2存储过程中返回一个数据集
 Host is not allowed to connect to this MySQL server 解决方案
 CentOS安装中文支持
 ImportError: libpq.so.5: cannot open shared object file: No such file or directory
CentOS 终端显示中文异常解决办法
 pytestDemo
python 获取当前运行的类名函数名

原文地址：https://www.cnblogs.com/barneywill/p/15226155.html