zoukankan      html  css  js  c++  java
  • spark shc hbase 超时问题 hbase.client.scanner.timeout.period 配置

    异常信息

    20/02/27 19:36:21 INFO TaskSetManager: Starting task 17.1 in stage 3.0 (TID 56, 725.slave.adh, executor 50, partition 17, RACK_LOCAL, 9698 bytes)
    20/02/27 19:36:22 WARN TaskSetManager: Lost task 21.0 in stage 3.0 (TID 24, 728.slave.adh, executor 63): org.apache.hadoop.hbase.client.ScannerTimeoutException: 6603499ms passed since the last invocation, timeout is currently set to 3600000
    	at org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:434)
    	at org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:364)
    	at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD$$anon$2.hasNext(HBaseTableScan.scala:187)
    	at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:216)
    	at scala.collection.Iterator$ConcatIterator.advance(Iterator.scala:183)
    	at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:195)
    	at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:192)
    	at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD$$anon$3.hasNext(HBaseTableScan.scala:215)
    	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
    	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:148)
    	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    	at org.apache.spark.scheduler.Task.run(Task.scala:109)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	at java.lang.Thread.run(Thread.java:748)
    Caused by: org.apache.hadoop.hbase.UnknownScannerException: org.apache.hadoop.hbase.UnknownScannerException: Name: 39288877, already closed?
    	at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2128)
    	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32205)
    	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034)
    	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
    	at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
    	at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
    	at java.lang.Thread.run(Thread.java:745)
    
    	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    	at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:97)
    	at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:266)
    	at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:62)
    	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
    	at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:350)
    	at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:324)
    	at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126)
    	at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:64)
    	... 3 more

    ---

    首先查到了需要调整参数 base.client.scanner.timeout.period,项目使用shc 不是外部维护的conf,配置如何加是个问题

    方式1  改本地配置,找到两个可能的配置文件
    /opt/hbase/conf/hbase-site.xml
    /opt/hadoop/etc/hadoop/hbase-site.xml

    添加

    <property>
    <name>hbase.client.scanner.timeout.period</name>
    <value>36100000</value>
    </property>

    提交,问题依旧


    方式2 官方 readme.md 有相关的示例

    https://github.com/hortonworks-spark/shc

    ./bin/spark-submit --class your.application.class --master yarn-client --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11 --repositories http://repo.hortonworks.com/content/groups/public/ --jars /usr/hdp/current/phoenix-client/phoenix-server.jar --files /etc/hbase/conf/hbase-site.xml /To/your/application/jar

    主要看到提交了  --files /etc/hbase/conf/hbase-site.xml  文件

    更改本地 hbase-site.xml 添加

    <property>
    <name>hbase.client.scanner.timeout.period</name>
    <value>36100000</value>
    </property>

    后 spark-submit --files /etc/hbase/conf/hbase-site.xml

    线上任务失败报错,任务无法执行,猜测是线上本身有hbase-site.xml和本地的hbase-site.xml 不一致,提交本地hbase-site.xml文件,覆盖了原本正常的配置,导致异常

    可以找hbase的维护方,要一个完整的线上配置文件,再添加hbase.client.scanner.timeout.period 项后提交。

    方式3 在没有线上原始hbase-site.xml的情况下,试试提交hbase-default.xml

    新建文件 hbase-default.xml

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
    <property>
    <name>hbase.client.scanner.timeout.period</name>
    <value>3620000</value>
    </property>
    </configuration>

    后 spark-submit --files /etc/hbase/conf/hbase-default.xml

    报错
    20/02/27 22:53:40 INFO SparkContext: Successfully stopped SparkContext
    20/02/27 22:53:40 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: java.lang.RuntimeException: hbase-default.xml file seems to be for an older version of H
    Base (null), this version is 1.2.2
    at org.apache.hadoop.hbase.HBaseConfiguration.checkDefaultsVersion(HBaseConfiguration.java:71)

    首先报错,原因是hbase-site.xml检查版本,hbase-default.xml版本不一致,虽然报错,不过看到希望了,有检测,表示会加载

    添加项 hbase.defaults.for.version和线上hbase版本一致

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
    <property>
    <name>hbase.client.scanner.timeout.period</name>
    <value>3620000</value>
    </property>
    <property>
    <name>hbase.defaults.for.version</name>
    <value>1.2.2</value>
    </property> 
    </configuration>

    提交任务执行正常

    但依然报错

    20/02/27 19:36:22 WARN TaskSetManager: Lost task 21.0 in stage 3.0 (TID 24, 728.slave.adh, executor 63): org.apache.hadoop.hbase.client.ScannerTimeoutException: 3803499ms passed since the last invocation, timeout is currently set to 3600000

    hbase-default.xml的配置根本就没有生效,比较奇怪,有检测版本的异常,则应该是加载hbase-default.xml文件,配置已经加进去了,先放下

    ---

    方法4

    官方

    https://github.com/hortonworks-spark/shc/issues/160

    There are two ways to do this:
    (1) put your extra configurations in a file, and make the file as the value of HBaseRelation.HBASE_CONFIGFILE. Refer to here.

    (2) put your extra configurations in json format, and make the json as the value of HBaseRelation.HBASE_CONFIGURATION.

    没有指定HBaseRelation.HBASE_CONFIGFILE则用path下的配置,但上面几种改hbase-default.xml,hbase-site.xml的方式都失败了


    试试  HBaseRelation.HBASE_CONFIGURATION.

    相关代码 https://github.com/hortonworks-spark/shc/blob/master/core/src/main/scala/org/apache/spark/sql/execution/datasources/hbase/HBaseRelation.scala

    val hBaseConfiguration = parameters.get(HBaseRelation.HBASE_CONFIGURATION).map(
    parse(_).extract[Map[String, String]])
    
    al conf = HBaseConfiguration.create
    hBaseConfiguration.foreach(_.foreach(e => conf.set(e._1, e._2)))
    hBaseConfigFile.foreach(e => conf.set(e._1, e._2))
    conf

    parse转json字符串串,再提取extract为 k:v 结构,问时是看这样子json串里的配置会被hbase-site.xml里的替换掉,不知道线上hbase-site.xml里有没有这相配置

    试试

    .options(Map(
    HBaseTableCatalog.tableCatalog -> catalog.catalogEsDocByFields(hTable, fields),
    HBaseRelation.HBASE_CONFIGURATION ->"{"hbase.client.scanner.timeout.period": "3820000"}"
    ))

    提交任务,任务执行

    20/02/28 03:35:15 ERROR Executor: Exception in task 16.1 in stage 3.0 (TID 50)
    org.apache.hadoop.hbase.client.ScannerTimeoutException: 4092211ms passed since the last invocation, timeout is currently set to 3820000

    3820000 虽然报错,但base.client.scanner.timeout.period这个参数是终于生效了

    问题解决,补充,因为不同yarn集群path下的hbase-site.xml内容可能不同,方案并不适用全部场景

  • 相关阅读:
    Design and Analysis of Algorithms_Decrease-and-Conquer
    TCPL 札记
    谬论:64 = 65?
    二叉树内部顶点与外部顶点在数量上的关系
    Design and Analysis of Algorithms_Divide-and-Conquer
    LeetCode 36. Valid Sudoku
    LeetCode 58. Length of Last Word
    LeetCode 66. Plus One
    LeetCode 67. Add Binary
    LeetCode 70. Climbing Stairs
  • 原文地址:https://www.cnblogs.com/zihunqingxin/p/12375879.html
Copyright © 2011-2022 走看看