zoukankan      html  css  js  c++  java
  • tachyon与hdfs,以及spark整合

    Tachyon 0.7.1伪分布式集群安装与测试:
    http://blog.csdn.net/stark_summer/article/details/48321605
    从官方文档得知,Spark 1.4.x和Tachyon 0.6.4版本兼容,而最新版的Tachyon 0.7.1和Spark 1.5.x兼容,目前所用的Spark为1.4.1,tachyon为 0.7.1

    tachyon 与 hdfs整合

    修改tachyon-env.sh

    export TACHYON_UNDERFS_ADDRESS=hdfs://master:8020
    Dtachyon.data.folder=$TACHYON_UNDERFS_ADDRESS/tmp/tachyon/data

    上传文件到hdfs

     hadoop fs -put /home/cluster/data/test/bank/ /data/spark/
    
     hadoop fs -ls /data/spark/bank/
    Found 3 items
    -rw-r--r--   3 wangyue supergroup    4610348 2015-09-11 20:02 /data/spark/bank/bank-full.csv
    -rw-r--r--   3 wangyue supergroup       3864 2015-09-11 20:02 /data/spark/bank/bank-names.txt
    -rw-r--r--   3 wangyue supergroup     461474 2015-09-11 20:02 /data/spark/bank/bank.csv

    通过tachyon 读取/data/spark/bank/bank-full.csv文件

    val bankFullFile = sc.textFile("tachyon://master:19998/data/spark/bank/bank-full.csv/bank-full.csv")
    2015-09-11 20:08:20,136 INFO  [main] storage.MemoryStore (Logging.scala:logInfo(59)) - ensureFreeSpace(177384) called with curMem=630803, maxMem=257918238
    2015-09-11 20:08:20,137 INFO  [main] storage.MemoryStore (Logging.scala:logInfo(59)) - Block broadcast_3 stored as values in memory (estimated size 173.2 KB, free 245.2 MB)
    2015-09-11 20:08:20,154 INFO  [main] storage.MemoryStore (Logging.scala:logInfo(59)) - ensureFreeSpace(17665) called with curMem=808187, maxMem=257918238
    2015-09-11 20:08:20,155 INFO  [main] storage.MemoryStore (Logging.scala:logInfo(59)) - Block broadcast_3_piece0 stored as bytes in memory (estimated size 17.3 KB, free 245.2 MB)
    2015-09-11 20:08:20,156 INFO  [sparkDriver-akka.actor.default-dispatcher-2] storage.BlockManagerInfo (Logging.scala:logInfo(59)) - Added broadcast_3_piece0 in memory on localhost:41040 (size: 17.3 KB, free: 245.9 MB)
    2015-09-11 20:08:20,157 INFO  [main] spark.SparkContext (Logging.scala:logInfo(59)) - Created broadcast 3 from textFile at <console>:21
    bankFullFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at textFile at <console>:21
    

    count

    bankFullFile.count()
    但是发现报错如下:
    2015-09-11 21:34:31,494 WARN  [Executor task launch worker-6]  (RemoteBlockInStream.java:retrieveByteBufferFromRemoteMachine(320)) - Read nothing
    2015-09-11 21:34:31,495 WARN  [Executor task launch worker-6]  (RemoteBlockInStream.java:retrieveByteBufferFromRemoteMachine(320)) - Read nothing
    2015-09-11 21:34:31,489 WARN  [Executor task launch worker-7]  (RemoteBlockInStream.java:retrieveByteBufferFromRemoteMachine(320)) - Read nothing
    2015-09-11 21:34:31,495 WARN  [Executor task launch worker-7]  (RemoteBlockInStream.java:retrieveByteBufferFromRemoteMachine(320)) - Read nothing
    2015-09-11 21:34:31,495 WARN  [Executor task launch worker-7]  (RemoteBlockInStream.java:retrieveByteBufferFromRemoteMachine(320)) - Read nothing
    2015-09-11 21:34:31,495 WARN  [Executor task launch worker-7]  (RemoteBlockInStream.java:retrieveByteBufferFromRemoteMachine(320)) - Read nothing
    2015-09-11 21:34:31,495 WARN  [Executor task launch worker-7]  (RemoteBlockInStream.java:retrieveByteBufferFromRemoteMachine(320)) - Read nothing
    2015-09-11 21:34:31,495 WARN  [Executor task launch worker-7]  (RemoteBlockInStream.java:retrieveByteBufferFromRemoteMachine(320)) - Read nothing
    2015-09-11 21:34:31,496 WARN  [Executor task launch worker-7]  (RemoteBlockInStream.java:retrieveByteBufferFromRemoteMachine(320)) - Read nothing
    2015-09-11 21:34:31,496 WARN  [Executor task launch worker-7]  (RemoteBlockInStream.java:retrieveByteBufferFromRemoteMachine(320)) - Read nothing
    2015-09-11 21:34:31,496 WARN  [Executor task launch worker-7]  (RemoteBlockInStream.java:retrieveByteBufferFromRemoteMachine(320)) - Read nothing
    2015-09-11 21:34:31,496 WARN  [Executor task launch worker-7]  (RemoteBlockInStream.java:retrieveByteBufferFromRemoteMachine(320)) - Read nothing
    2015-09-11 21:34:31,496 WARN  [Executor task launch worker-7]  (RemoteBlockInStream.java:retrieveByteBufferFromRemoteMachine(320)) - Read nothing
    2015-09-11 21:34:31,496 WARN  [Executor task launch worker-7]  (RemoteBlockInStream.java:retrieveByteBufferFromRemoteMachine(320)) - Read nothing
    2015-09-11 21:34:31,496 WARN  [Executor task launch worker-7]  (RemoteBlockInStream.java:retrieveByteBufferFromRemoteMachine(320)) - Read nothing
    

    感觉错误很诡异,有人知道这是什么原因?tell me why?

    但是 我在tachyon 文件系统中可以看到如下内容:

    ./bin/tachyon tfs ls /data/spark/bank/bank-full.csv/
    4502.29 KB09-11-2015 20:09:02:078  Not In Memory  /data/spark/bank/bank-full.csv/bank-full.csv
    

    而bank-full.csv在hdfs文件是

    hadoop fs -ls /data/spark/bank/
    Found 3 items
    -rw-r--r--   3 wangyue supergroup    4610348 2015-09-11 20:02 /data/spark/bank/bank-full.csv
    -rw-r--r--   3 wangyue supergroup       3864 2015-09-11 20:02 /data/spark/bank/bank-names.txt
    -rw-r--r--   3 wangyue supergroup     461474 2015-09-11 20:02 /data/spark/bank/bank.csv
    

    其实Tachyon本身将bank-full.csv文件加载到了内存,并存放到自身的文件系统里面:tachyon://master:19998/data/spark/bank/bank-full.csv/bank-full.csv”
    Tachyon的conf/tachyon-env.sh文件里面配置的,通过export TACHYON_UNDERFS_ADDRESS=hdfs://master:8020配置,这样tachyon://localhost:19998就可以获取hdfs文件指定路径文件

    好吧,那我就先通过hdfs方式读取文件然后 保存到tachyon

    scala> val bankfullfile =  sc.textFile("/data/spark/bank/bank-full.csv")
    scala> bankfullfile.count
    res0: Long = 45212
    
    scala> bankfullfile.saveAsTextFile("tachyon://master:19998/data/spark/bank/newbankfullfile")

    未完成,待续~

    版权声明:本文为博主原创文章,未经博主允许不得转载。

  • 相关阅读:
    防火墙透明模式
    HP管理工具System Management Homepage安装配置
    kbmmw 中JSON 中使用SQL 查询
    kbmmw 中JSON 操作入门
    第一个kbmmw for Linux 服务器
    kbmmw 5.02发布
    kbmmw 5.01 发布
    使用delphi 10.2 开发linux 上的Daemon
    使用unidac 在linux 上无驱动直接访问MS SQL SERVER
    使用delphi 10.2 开发linux 上的webservice
  • 原文地址:https://www.cnblogs.com/stark-summer/p/4829741.html
Copyright © 2011-2022 走看看