spark 使用shc 访问hbase超时问题解决办法
跑一批spark算法任务,从hbase拉数据就超过10小时,导致超时
spark访问hbase 使用 spark sql + shc
20/02/27 19:36:21 INFO TaskSetManager: Starting task 17.1 in stage 3.0 (TID 56, 725.slave.adh, executor 50, partition 17, RACK_LOCAL, 9698 bytes)
20/02/27 19:36:22 WARN TaskSetManager: Lost task 21.0 in stage 3.0 (TID 24, 728.slave.adh, executor 63): org.apache.hadoop.hbase.client.ScannerTimeoutException: 6603499ms passed since the last invocation, timeout is currently set to 3600000
at org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:434)
at org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:364)
at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD$$anon$2.hasNext(HBaseTableScan.scala:187)
at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:216)
at scala.collection.Iterator$ConcatIterator.advance(Iterator.scala:183)
at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:195)
at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:192)
at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD$$anon$3.hasNext(HBaseTableScan.scala:215)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:148)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hbase.UnknownScannerException: org.apache.hadoop.hbase.UnknownScannerException: Name: 39288877, already closed?
at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2128)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32205)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
at java.lang.Thread.run(Thread.java:745)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:97)
at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:266)
at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:62)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:350)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:324)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126)
at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:64)
... 3 more
首先查到了需要调整参数 base.client.scanner.timeout.period,项目使用 shc 不是自已维护的conf
-
官方 readme 有相关的示例 引用了
/etc/hbase/conf/hbase-site.xml
./bin/spark-submit --class your.application.class --master yarn-client --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11 --repositories http://repo.hortonworks.com/content/groups/public/ --jars /usr/hdp/current/phoenix-client/phoenix-server.jar --files /etc/hbase/conf/hbase-site.xml /To/your/application/jar
更改本地/etc/hbase/conf/hbase-site.xml
添加
<property>
<name>hbase.client.scanner.timeout.period</name>
<value>36100000</value>
</property>
spark-submit --files /etc/hbase/conf/hbase-site.xml
线上任务失败报错,任务无法执行,猜测是线上本身有hbase-site.xml和本地的hbase-site.xml 不一致,提交本地的hbase-site.xml文件,覆盖了原本正常的配置,导致异常
新建文件 hbase-default.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.client.scanner.timeout.period</name>
<value>3610000</value>
</property>
</configuration>
后 spark-submit --files /etc/hbase/conf/hbase-default.xml
报错
20/02/27 22:53:40 INFO SparkContext: Successfully stopped SparkContext
20/02/27 22:53:40 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: java.lang.RuntimeException: hbase-default.xml file seems to be for an older version of H
Base (null), this version is 1.2.2
at org.apache.hadoop.hbase.HBaseConfiguration.checkDefaultsVersion(HBaseConfiguration.java:71)
首先报错,原因是hbase-site.xml检查版本,hbase-default.xml版本不一致
添加项 hbase.defaults.for.version
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.client.scanner.timeout.period</name>
<value>3610000</value>
</property>
<property>
<name>hbase.defaults.for.version</name>
<value>1.2.2</value>
</property>
</configuration>
提交任务执行正常,但并没有解决超时问题
20/02/27 19:36:22 WARN TaskSetManager: Lost task 21.0 in stage 3.0 (TID 24, 728.slave.adh, executor 63): org.apache.hadoop.hbase.client.ScannerTimeoutException: 3803499ms passed since the last invocation, timeout is currently set to 3600000
hbase-default.xml的配置根本就没有生效,比较奇怪,有检测版本的异常,则应该是加载hbase-default.xml文件,配置已经加进去了,应该是配置的优先级问题
几种改配置的方式失败,只好看找官方支持和看代码了
查看shc官方文档有类似的issue
https://github.com/hortonworks-spark/shc/issues/160
There are two ways to do this:
(1) put your extra configurations in a file, and make the file as the value of HBaseRelation.HBASE_CONFIGFILE. Refer to here.
(2) put your extra configurations in json format, and make the json as the value of HBaseRelation.HBASE_CONFIGURATION.
实际上面两项和HBaseRelation.HBASE_CONFIGFILE 的类似,没有指定则用path下的hbase-site.xml。但出问题
HBaseRelation.HBASE_CONFIGURATION.
相关代码 在https://github.com/hortonworks-spark/shc/blob/master/core/src/main/scala/org/apache/spark/sql/execution/datasources/hbase/HBaseRelation.scala
val hBaseConfiguration = parameters.get(HBaseRelation.HBASE_CONFIGURATION).map(
parse(_).extract[Map[String, String]])
val cFile = parameters.get(HBaseRelation.HBASE_CONFIGFILE)
val hBaseConfigFile = {
var confMap: Map[String, String] = Map.empty
if (cFile.isDefined) {
val xmlFile = XML.loadFile(cFile.get)
(xmlFile \ "property").foreach(
x => { confMap += ((x "name").text -> (x "value").text) })
}
confMap
}
val conf = HBaseConfiguration.create
hBaseConfiguration.foreach(_.foreach(e => conf.set(e._1, e._2)))
hBaseConfigFile.foreach(e => conf.set(e._1, e._2))
conf
HBaseRelation.HBASE_CONFIGURATION 为json字符串,parse转json字符串为map,再提取extract为 k:v 结构,再conf.set
如果设置了 HBaseRelation.HBASE_CONFIGFILE 则再设置HBaseRelation.HBASE_CONFIGFILE内的值
试试添加HBaseRelation.HBASE_CONFIGURATION
sparkSession.read
.options(Map(
HBaseTableCatalog.tableCatalog -> needReUpdateTagHbaseDesc(day),
HBaseRelation.HBASE_CONFIGURATION -> "{"hbase.client.scanner.timeout.period": "3820000","hbase.rpc.timeout": "36000000","hbase.client.scanner.caching": "1000"}"
))
.format("org.apache.spark.sql.execution.datasources.hbase")
.load
.select()
提交任务,任务执行,跑了一晚上
20/02/28 03:35:15 ERROR Executor: Exception in task 16.1 in stage 3.0 (TID 50)
org.apache.hadoop.hbase.client.ScannerTimeoutException: 4092211ms passed since the last invocation, timeout is currently set to 3820000
3820000 虽然报错,但这个参数是终于生效了,问题解决
后来又找到了新的解决办法,之前的改配置的方法也没错,但没改完
env 查看spark相关环境变量
SPARK_CONF_DIR=/etc/spark
/etc/spark 下有hbase-site.xml文件,更改/etc/spark/hbase-site.xml 内的hbase.client.scanner.timeout.period
值即可
那时server端的 hadoop,hbase的维护的比较多,思维惯性了,直接先改了hadoop,hbase的配置,忽视了spark env client也可能有配置要更改
/opt/hbase-1.0.0-cdh5.5.1/conf/hbase-site.xml
/opt/hadoop-2.3.0-cdh5.1.2/etc/hadoop/hbase-site.xml
这两个hbase-site.xml配置,但spark实际加载的优先级/etc/spark/hbase-site.xml 更高,导致更改不生效
但也不算是做了无用功
/etc/spark/hbase-site.xml下的配置是全局的,默认提交的所有spark任务生效
而代码里设置的 HBaseRelation.HBASE_CONFIGURATION,只对项目内部生效,控制粒度更细