zoukankan      html  css  js  c++  java
  • spark 使用shc 访问hbase超时问题解决办法

    spark 使用shc 访问hbase超时问题解决办法

    跑一批spark算法任务,从hbase拉数据就超过10小时,导致超时

    spark访问hbase 使用 spark sql + shc

    20/02/27 19:36:21 INFO TaskSetManager: Starting task 17.1 in stage 3.0 (TID 56, 725.slave.adh, executor 50, partition 17, RACK_LOCAL, 9698 bytes)
    20/02/27 19:36:22 WARN TaskSetManager: Lost task 21.0 in stage 3.0 (TID 24, 728.slave.adh, executor 63): org.apache.hadoop.hbase.client.ScannerTimeoutException: 6603499ms passed since the last invocation, timeout is currently set to 3600000
    	at org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:434)
    	at org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:364)
    	at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD$$anon$2.hasNext(HBaseTableScan.scala:187)
    	at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:216)
    	at scala.collection.Iterator$ConcatIterator.advance(Iterator.scala:183)
    	at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:195)
    	at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:192)
    	at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD$$anon$3.hasNext(HBaseTableScan.scala:215)
    	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
    	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:148)
    	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    	at org.apache.spark.scheduler.Task.run(Task.scala:109)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	at java.lang.Thread.run(Thread.java:748)
    Caused by: org.apache.hadoop.hbase.UnknownScannerException: org.apache.hadoop.hbase.UnknownScannerException: Name: 39288877, already closed?
    	at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2128)
    	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32205)
    	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034)
    	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
    	at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
    	at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
    	at java.lang.Thread.run(Thread.java:745)
    
    	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    	at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:97)
    	at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:266)
    	at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:62)
    	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
    	at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:350)
    	at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:324)
    	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126)
    	at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:64)
    	... 3 more
    

    首先查到了需要调整参数 base.client.scanner.timeout.period,项目使用 shc 不是自已维护的conf

    • 官方 readme 有相关的示例 引用了/etc/hbase/conf/hbase-site.xml

      ./bin/spark-submit --class your.application.class --master yarn-client --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11 --repositories http://repo.hortonworks.com/content/groups/public/ --jars /usr/hdp/current/phoenix-client/phoenix-server.jar --files /etc/hbase/conf/hbase-site.xml /To/your/application/jar

    更改本地/etc/hbase/conf/hbase-site.xml添加

    <property>
       <name>hbase.client.scanner.timeout.period</name>
       <value>36100000</value>
    </property>
    

    spark-submit --files /etc/hbase/conf/hbase-site.xml

    线上任务失败报错,任务无法执行,猜测是线上本身有hbase-site.xml和本地的hbase-site.xml 不一致,提交本地的hbase-site.xml文件,覆盖了原本正常的配置,导致异常

    新建文件 hbase-default.xml

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
     <property>
       <name>hbase.client.scanner.timeout.period</name>
       <value>3610000</value>
     </property>
    </configuration>
    

    后 spark-submit --files /etc/hbase/conf/hbase-default.xml

    报错

    20/02/27 22:53:40 INFO SparkContext: Successfully stopped SparkContext
    20/02/27 22:53:40 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: java.lang.RuntimeException: hbase-default.xml file seems to be for an older version of H
    Base (null), this version is 1.2.2
    	at org.apache.hadoop.hbase.HBaseConfiguration.checkDefaultsVersion(HBaseConfiguration.java:71)
    

    首先报错,原因是hbase-site.xml检查版本,hbase-default.xml版本不一致

    添加项 hbase.defaults.for.version

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
     <property>
       <name>hbase.client.scanner.timeout.period</name>
       <value>3610000</value>
     </property>
     <property>
        <name>hbase.defaults.for.version</name>
        <value>1.2.2</value>
     </property> 
    </configuration>
    

    提交任务执行正常,但并没有解决超时问题

    20/02/27 19:36:22 WARN TaskSetManager: Lost task 21.0 in stage 3.0 (TID 24, 728.slave.adh, executor 63): org.apache.hadoop.hbase.client.ScannerTimeoutException: 3803499ms passed since the last invocation, timeout is currently set to 3600000
    

    hbase-default.xml的配置根本就没有生效,比较奇怪,有检测版本的异常,则应该是加载hbase-default.xml文件,配置已经加进去了,应该是配置的优先级问题

    几种改配置的方式失败,只好看找官方支持和看代码了

    查看shc官方文档有类似的issue

    https://github.com/hortonworks-spark/shc/issues/160

    There are two ways to do this:
    (1) put your extra configurations in a file, and make the file as the value of HBaseRelation.HBASE_CONFIGFILE. Refer to here.

    (2) put your extra configurations in json format, and make the json as the value of HBaseRelation.HBASE_CONFIGURATION.

    实际上面两项和HBaseRelation.HBASE_CONFIGFILE 的类似,没有指定则用path下的hbase-site.xml。但出问题

    HBaseRelation.HBASE_CONFIGURATION.

    相关代码 在https://github.com/hortonworks-spark/shc/blob/master/core/src/main/scala/org/apache/spark/sql/execution/datasources/hbase/HBaseRelation.scala

            val hBaseConfiguration = parameters.get(HBaseRelation.HBASE_CONFIGURATION).map(
              parse(_).extract[Map[String, String]])
    
            val cFile = parameters.get(HBaseRelation.HBASE_CONFIGFILE)
            val hBaseConfigFile = {
              var confMap: Map[String, String] = Map.empty
              if (cFile.isDefined) {
                val xmlFile = XML.loadFile(cFile.get)
                (xmlFile \ "property").foreach(
                  x => { confMap += ((x  "name").text -> (x  "value").text) })
              }
              confMap
            }
    
            val conf = HBaseConfiguration.create
            hBaseConfiguration.foreach(_.foreach(e => conf.set(e._1, e._2)))
            hBaseConfigFile.foreach(e => conf.set(e._1, e._2))
            conf
    

    HBaseRelation.HBASE_CONFIGURATION 为json字符串,parse转json字符串为map,再提取extract为 k:v 结构,再conf.set

    如果设置了 HBaseRelation.HBASE_CONFIGFILE 则再设置HBaseRelation.HBASE_CONFIGFILE内的值

    试试添加HBaseRelation.HBASE_CONFIGURATION

      sparkSession.read
            .options(Map(
              HBaseTableCatalog.tableCatalog -> needReUpdateTagHbaseDesc(day),
              HBaseRelation.HBASE_CONFIGURATION -> "{"hbase.client.scanner.timeout.period": "3820000","hbase.rpc.timeout": "36000000","hbase.client.scanner.caching": "1000"}"
            ))
            .format("org.apache.spark.sql.execution.datasources.hbase")
            .load
            .select()
    

    提交任务,任务执行,跑了一晚上

    20/02/28 03:35:15 ERROR Executor: Exception in task 16.1 in stage 3.0 (TID 50)
    org.apache.hadoop.hbase.client.ScannerTimeoutException: 4092211ms passed since the last invocation, timeout is currently set to 3820000

    3820000 虽然报错,但这个参数是终于生效了,问题解决


    后来又找到了新的解决办法,之前的改配置的方法也没错,但没改完

    env 查看spark相关环境变量

    SPARK_CONF_DIR=/etc/spark

    /etc/spark 下有hbase-site.xml文件,更改/etc/spark/hbase-site.xml 内的hbase.client.scanner.timeout.period值即可

    那时server端的 hadoop,hbase的维护的比较多,思维惯性了,直接先改了hadoop,hbase的配置,忽视了spark env client也可能有配置要更改

    /opt/hbase-1.0.0-cdh5.5.1/conf/hbase-site.xml
    /opt/hadoop-2.3.0-cdh5.1.2/etc/hadoop/hbase-site.xml

    这两个hbase-site.xml配置,但spark实际加载的优先级/etc/spark/hbase-site.xml 更高,导致更改不生效


    但也不算是做了无用功

    /etc/spark/hbase-site.xml下的配置是全局的,默认提交的所有spark任务生效

    而代码里设置的 HBaseRelation.HBASE_CONFIGURATION,只对项目内部生效,控制粒度更细

  • 相关阅读:
    Codeforces Round #622 (Div. 2)
    Knapsack Cryptosystem 牛客团队赛
    HDU 2586(LCA欧拉序和st表)
    P3865 【模板】ST表
    P2023 [AHOI2009]维护序列 区间加乘模板
    P1558 色板游戏 线段树(区间修改,区间查询)
    Codeforces Round #621 (Div. 1 + Div. 2) D
    Codeforces Round #620 (Div. 2) E
    Educational Codeforces Round 82 (Rated for Div. 2)
    洛谷P1638 逛画展
  • 原文地址:https://www.cnblogs.com/zihunqingxin/p/14459619.html
Copyright © 2011-2022 走看看