zoukankan      html  css  js  c++  java
  • 记一次newApiHadoopRdd查询数据不一致问题

    现象:

    +----------+-------+--------+-----+-----+-----+----+----+------+---------+-------+--------+--------+------------+
    |totalCount|January|February|March|April| May|June|July|August|September|October|November|December|totalMileage|
    +----------+-------+--------+-----+-----+-----+----+----+------+---------+-------+--------+--------+------------+
    | 33808| 0| 0| 0| 0|33798| 0| 0| 0| 0| 0| 0| 0| 79995.0|
    +----------+-------+--------+-----+-----+-----+----+----+------+---------+-------+--------+--------+------------+

    当前表预分区10个

    按照当月数据看,当前测试表中总数量是:33798

    hbase的总数量也是:33798

    神奇的地方:使用sparkSQL对接hbase查询的数量是:33808

    当时的sql语句是:select count(1) from orderData

    很神奇,因为通过sql查询后,总数据多了10条

    ============================================================

    原因:

    这里设置了hbase SCAN_BATCHSIZE这个值,会设置scan的batchsize。这个设置的文档是这样说的:

    Set the maximum number of values to return for each call to next()

    之前一直以为这里是设置一次读取多少行,其实values貌似是读取多少列,并且开启了这个值会导致hbase scan时返回一行的部分结果;

    于是将这个设置注释掉,程序即可正常运行

    进一步的,我们从hbase端代码看看这个设置。hbase的scan会两个成员变量:

    • private boolean allowPartialResults = false;
    • private int batch = -1;

    allowPartialResult这个很明显就是会返回部分结果的设置,那么这个batch呢?setBatch()时并不会设置allowPartialResult。但是在Scan的getResultsToAddToCache()函数中,如果batch值大于0,会设置isBatch=true。之后会有这段代码:

    // If the caller has indicated in their scan that they are okay with seeing partial results,
    // then simply add all results to the list. Note that since scan batching also returns results
    // for a row in pieces we treat batch being set as equivalent to allowing partials. The
    // implication of treating batching as equivalent to partial results is that it is possible
    // the caller will receive a result back where the number of cells in the result is less than
    // the batch size even though it may not be the last group of cells for that row.
        if (allowPartials || isBatchSet) {
          addResultsToList(resultsToAddToCache, resultsFromServer, 0,
              (null == resultsFromServer ? 0 : resultsFromServer.length));
          return resultsToAddToCache;
        }

    之前错误代码:

    TableInputFormat.SCAN_BATCHSIZE
    lazy val buildScan = {
    
        val hbaseConf = HBaseConfiguration.create()
        hbaseConf.set("hbase.zookeeper.quorum", GlobalConfigUtils.hbaseQuorem)
        hbaseConf.set(TableInputFormat.INPUT_TABLE, hbaseTableName)
        hbaseConf.set(TableInputFormat.SCAN_COLUMNS, queryColumns)
        hbaseConf.set(TableInputFormat.SCAN_ROW_START, startRowKey)
        hbaseConf.set(TableInputFormat.SCAN_ROW_STOP, endRowKey)
        hbaseConf.set(TableInputFormat.SCAN_BATCHSIZE , "10000")//TODO 此处导致查询数据不一致
        hbaseConf.set(TableInputFormat.SCAN_CACHEDROWS , "10000")
        hbaseConf.set(TableInputFormat.SHUFFLE_MAPS , "1000")
    
        val hbaseRdd = sqlContext.sparkContext.newAPIHadoopRDD(
          hbaseConf,
          classOf[org.apache.hadoop.hbase.mapreduce.TableInputFormat],
          classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
          classOf[org.apache.hadoop.hbase.client.Result]
        )
    
        val rs: RDD[Row] = hbaseRdd.map(tuple => tuple._2).map(result => {
    
          var values = new ArrayBuffer[Any]()
          hbaseTableFields.foreach { field =>
            values += Resolver.resolve(field, result)
          }
          Row.fromSeq(values.toSeq)
        })
        rs
      }

    解决:

    去掉TableInputFormat.SCAN_BATCHSIZE的设置即可

    去掉后的查询结果:

    +----------+-------+--------+-----+-----+-----+----+----+------+---------+-------+--------+--------+------------+
    |totalCount|January|February|March|April| May|June|July|August|September|October|November|December|totalMileage|
    +----------+-------+--------+-----+-----+-----+----+----+------+---------+-------+--------+--------+------------+
    | 33798| 0| 0| 0| 0|33798| 0| 0| 0| 0| 0| 0| 0| 79995.0|
    +----------+-------+--------+-----+-----+-----+----+----+------+---------+-------+--------+--------+------------+

    问题解决~

  • 相关阅读:
    剑指offer---链表中倒数第k个结点
    剑指offer---反转链表
    剑指offer---从尾到头打印链表
    数据结构---链表ADT C++实现
    ubuntu解压zip文件出现乱码情况解决方法
    Ubuntu终端常用的快捷键(转载)
    requsets模块的学习
    爬虫的基本知识
    谈谈我们对userAgent的看法,为什么爬虫中需要userAgent?
    git的基本使用
  • 原文地址:https://www.cnblogs.com/niutao/p/10824749.html
Copyright © 2011-2022 走看看