zoukankan      html  css  js  c++  java
  • hadoop集群运维碰到的问题汇总

    1.zookeeper报错

    2017-12-13 16:47:55,968 [myid:] - INFO  [main-SendThread(localhost:2181):ClientCnxn$SendThread@975] - Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
    2017-12-13 16:47:55,968 [myid:] - WARN  [main-SendThread(localhost:2181):ClientCnxn$SendThread@1102] - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
    java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)

    原因:zookeeper节点挂了,启动即可

    2.kafka消费报错:Job aborted due to stage failure:kafka.common.OffsetOutOfRangeException

    Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): kafka.common.OffsetOutOfRangeException
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

    kafka message过期时间log.retention.hours=168

    解决:问题原因是,cosumer-group消费的offset已早于kafka存储的最早的message。参考blog里面有更详尽的解释

    获取topic mysqlslowlog的offset的最小值

    ./kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list=node:9092 --topic topic_name --time -2

    获取topic:mysqlslowlog的offset的最大值

    ./kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list=node:9092 --topic topic_name--time -1

    在zk上更新topic partition的offset

    #查partition  0最小值

    get /rootdir/consumers/[cousumer_group]/offsets/mysqlslowlog/0

    #更新partition  0最小值

    set /rootdir/consumers/[cousumer_group]/offsets/mysqlslowlog/0 3546232

    或者可以使用如下命令批量更新为最小值

    ./kafka-run-class.sh kafka.tools.UpdateOffsetsInZK earliest 

    参考:

    http://blog.csdn.net/xueba207/article/details/51135423
    http://blog.csdn.net/xueba207/article/details/51174818

    3.重启hbase regionserver节点报错:

    Server ...,1514436003346 has been rejected; Reported time is too far out of sync with master.  Time difference of 136758ms > max allowed of 30000ms

    一般是因为hmaster 节点和 regionserver节点时间不一致导致。同步时间,重启节点即可。

    4.摘除hdfs  datanode节点,datanode节点一直处于Decommission In Progress状态

    通过WEB UI查看:

    #低于副本数要求的blocks
    Under replicated blocks :2979
    #没有副本的blocks
    Blocks with no live replicas: 0
    #低于副本数要求的blocks,且正在创建中
    Under Replicated Blocks In files under construction:1

    或者通过../bin/hadoop dfsadmin -report命令查看datanode的状态。

    副本数为:2,当Under replicated blocks是越来越低,等于0时,应该就会完全摘除。

    另外,因为同一个rack的datanode节点一般会有一个副本,因此,可以通过修改副本数的方式,快速下线datanode

    #查看集群状态

    ./bin/hadoop fsck / -blocks -locations -files

    #修改副本数(当Blocks with no live replicas为0时可以操作)

     ./bin/hadoop fs -setrep -R 1 /

    #关闭datanode节点,

    ./sbin/hadoop-daemon.sh stop datanode

    #从slaves列表和rack列表中删掉对应节点

    #freshnode或者依次重启namenode

    ./bin/hdfs dfsadmin -refreshNodes
    ./bin/yarn rmadmin -refreshNodes

    5.摘除hdfs的datanode节点

    Failed to add xxxxxxxx:50010: You cannot have a rack and a non-rack node at the same level of the network topology.

     解决:

    通过 ./bin/hdfs dfsadmin -printTopology查看rack list

    刷新

    ./bin/hdfs dfsadmin -refreshNodes
    ./bin/yarn rmadmin -refreshNodes

    不管用,
    (1)页面依然显示状态为dead的datanode,
    (2)依然报You cannot have a rack and a non-rack node at the same level of the network topology.

    依次重启namenode,生效

    ./sbin/hadoop-daemon.sh stop namenode
    ./sbin/hadoop-daemon.sh start namenode

    通过

    ./bin/hdfs dfsadmin -printTopology

    查看rack信息,应该被摘掉的节点也不再显示

  • 相关阅读:
    无线鼠标换电池了
    Jython Interactive Servlet Console YOU WILL NEVER KNOW IT EXECLLENT!!! GOOD
    Accessing Jython from Java Without Using jythonc
    jython podcast cool isnt't it?
    Python里pycurl使用记录
    Creating an Interactive JRuby Console for the Eclipse Environment
    微软为AJAX和jQuery类库提供CDN服务
    Download A File Using Cygwin and cURL
    What is JMRI?这个是做什么用的,我真没看懂但看着又很强大
    用curl 发送指定的大cookie的http/https request
  • 原文地址:https://www.cnblogs.com/wyett/p/8146044.html
Copyright © 2011-2022 走看看