zoukankan      html  css  js  c++  java
  • Spark操作HBase问题:java.io.IOException: Non-increasing Bloom keys

    1 问题描述

    在使用Spark BulkLoad数据到HBase时遇到以下问题:

    17/05/19 14:47:26 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 12.0 (TID 79, bydslave5, executor 3): java.io.IOException: Non-increasing Bloom keys: 80a01055HAXMTXG10100001KEY_VOLTAGE_T_C_PWR after af401055HAXMTXG10100001KEY_VOLTAGE_TEC_PWR
    	at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.appendGeneralBloomfilter(StoreFile.java:911)
    	at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:947)
    	at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.write(HFileOutputFormat2.java:199)
    	at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.write(HFileOutputFormat2.java:152)
    	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1125)
    	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
    	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
    	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
    	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1131)
    	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    	at org.apache.spark.scheduler.Task.run(Task.scala:99)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    	at java.lang.Thread.run(Thread.java:745)
    

    那么是在什么时候出现的呢?在运行完下面语句

    val rdd = sc.textFile("/data/produce/2015/service.log.2017-04-24-08").map(_.split("@")).map{x => (DigestUtils.md5Hex(x(0)+x(1)).substring(0,3)+x(0)+x(1),x(2))}.map{x=>{val kv:KeyValue = new KeyValue(Bytes.toBytes(x._1),Bytes.toBytes("v"),Bytes.toBytes("value"),Bytes.toBytes(x._2+""));(new ImmutableBytesWritable(kv.getKey),kv)}}
    
    rdd.saveAsNewAPIHadoopFile("/tmp/data1",classOf[ImmutableBytesWritable],classOf[KeyValue],classOf[HFileOutputFormat],job.getConfiguration())
    

    从报错信息来看是因为key没有按照递增的顺序进行排列,可能是BloomFilter对key的排序有要求,但是我们知道key的无序是因为spark在shuffle阶段并没有像MapReduce那样强制排序,所以要解决这个问题我们需要手动地为数据进行排序,只需要对rdd执行sortBy即可。

    2 问题解决

    下面语句是增加排序的语句,经过测试运行通过

    val rdd = sc.textFile("/data/produce/2015/service.log.2017-04-24-08").map(_.split("@")).map{x => (DigestUtils.md5Hex(x(0)+x(1)).substring(0,3)+x(0)+x(1),x(2))}.sortBy(x =>x._1).map{x=>{val kv:KeyValue = new KeyValue(Bytes.toBytes(x._1),Bytes.toBytes("v"),Bytes.toBytes("value"),Bytes.toBytes(x._2+""));(new ImmutableBytesWritable(kv.getKey),kv)}}
    
    rdd.saveAsNewAPIHadoopFile("/tmp/data1",classOf[ImmutableBytesWritable],classOf[KeyValue],classOf[HFileOutputFormat],job.getConfiguration())
    
  • 相关阅读:
    14. Longest Common Prefix
    7. Reverse Integer
    用例图是软件项目成本预估的好帮手
    设计模式之创建性模式
    代码的核心定义文件
    一个项目经理的经验总结
    设计模式之结构型模式
    互联网发展十几年,你错过了哪些创业机会
    产品经理必读:像怀胎一样怀产品,要厚着脸皮听批评
    陌陌估值1亿美元:一个用户10美元,贵吗?
  • 原文地址:https://www.cnblogs.com/simple-focus/p/6881807.html
Copyright © 2011-2022 走看看