zoukankan      html  css  js  c++  java
  • AAS代码运行-第11章-2


    hdfs dfs -ls /user/littlesuccess/AdvancedAnalysisWithSpark
    hdfs dfs -mkdir /user/littlesuccess/AdvancedAnalysisWithSpark/ch11
    hdfs dfs -put fish.py /user/littlesuccess/AdvancedAnalysisWithSpark/ch11

    做好上述准备工作之后,就可以运行pyspark代码了:

    raw_data = sc.textFile('hdfs://172.31.25.243:8020/user/littlesuccess/AdvancedAnalysisWithSpark/ch11/fish.py')
    data = (raw_data.filter(lambda x: x.startswith("#")).map(lambda x: map(float, x.split(','))))
    data.take(5)

    运行过程中发现了一个错误:

    >>> data.take(5)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/spark/python/pyspark/rdd.py", line 1081, in take
        totalParts = self._jrdd.partitions().size()
      File "/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
      File "/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
    py4j.protocol.Py4JJavaError: An error occurred while calling o31.partitions.
    : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
        at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
        at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1713)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1322)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3974)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:813)
        at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getFileInfo(AuthorizationProviderProxyClientProtocol.java:502)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:815)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
        

    发现原因在于我的集群设置了NameNode HA,而我的脚本中的hdfs用的是StandBy NameNode的地址,这个问题就解决了。

    重新运行命令,又发现如下错误:

    15/07/04 13:53:42 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, ip-172-31-25-244.us-west-2.compute.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
      File "/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/jars/spark-assembly-1.2.0-cdh5.3.3-hadoop2.5.0-cdh5.3.3.jar/pyspark/worker.py", line 107, in main
        process()
      File "/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/jars/spark-assembly-1.2.0-cdh5.3.3-hadoop2.5.0-cdh5.3.3.jar/pyspark/worker.py", line 98, in process
        serializer.dump_stream(func(split_index, iterator), outfile)
      File "/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/jars/spark-assembly-1.2.0-cdh5.3.3-hadoop2.5.0-cdh5.3.3.jar/pyspark/serializers.py", line 227, in dump_stream
        vs = list(itertools.islice(iterator, batch))
      File "/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/spark/python/pyspark/rdd.py", line 1106, in takeUpToNumLeft
        while taken < left:
    ImportError: No module named iter
    
        at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:137)
        at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:174)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
        at org.apache.spark.scheduler.Task.run(Task.scala:56)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
  • 相关阅读:
    tp5项目后台比赛界面
    总结7.21 lavarel视图
    总结7.20 laravel自动验证
    java学习day78--JT项目16(CORS跨域/HttpCLient/jt-sso单点登录)
    java学习day77-JT项目15(Ajax跨域访问/JSONP)
    java学习day77-JT项目15(Redis集群算法/spring boot整合redis集群)
    java学习day76-JT项目14(Redis集群搭建)
    java学习day76-JT项目14(Redis分片机制/哨兵机制)
    检查Linux中发现没有IP
    java学习day75-JT项目13(AOP实现redis缓存/Redis缓存)
  • 原文地址:https://www.cnblogs.com/littlesuccess/p/4621320.html
Copyright © 2011-2022 走看看