zoukankan      html  css  js  c++  java
  • SparkContext自定义扩展textFiles,支持从多个目录中输入文本文件

    需求
     
    SparkContext自定义扩展textFiles,支持从多个目录中输入文本文件
     
    扩展
     
    class SparkContext(pyspark.SparkContext):
     
        def __init__(self, master=None, appName=None, sparkHome=None, pyFiles=None, environment=None, batchSize=0, serializer=PickleSerializer(), conf=None, gateway=None, jsc=None):
            pyspark.SparkContext.__init__(self, master=master, appName=appName, sparkHome=sparkHome, pyFiles=pyFiles,
                                          environment=environment, batchSize=batchSize, serializer=serializer, conf=conf, gateway=gateway, jsc=jsc)
     
        def textFiles(self, dirs):
            hadoopConf = {"mapreduce.input.fileinputformat.inputdir": ",".join(
                dirs), "mapreduce.input.fileinputformat.input.dir.recursive": "true"}
     
            pair = self.hadoopRDD(inputFormatClass="org.apache.hadoop.mapred.TextInputFormat",
                                  keyClass="org.apache.hadoop.io.LongWritable", valueClass="org.apache.hadoop.io.Text", conf=hadoopConf)
     
            text = pair.map(lambda pair: pair[1])
     
            return text
     
    示例
     
    from pyspark import SparkConf
    from dip.spark import SparkContext
     
    conf = SparkConf().setAppName("spark_textFiles_test")
     
    sc = SparkContext(conf=conf)
     
    dirs = ["hdfs://dip.cdh5.dev:8020/user/yurun/dir1",
            "hdfs://dip.cdh5.dev:8020/user/yurun/dir2"]
     
     
    def printLines(lines):
        if lines:
            for line in lines:
                print line
     
    lines = sc.textFiles(dirs).collect()
     
    printLines(lines)
     
    sc.stop()
     
  • 相关阅读:
    [bzoj2333] [SCOI2011]棘手的操作 (可并堆)
    自定义控件1_切换按钮
    View Animation 视图动画全解
    从图库中选取图片设置给ImageView
    一张图认识安卓shape属性
    自定义Dialog(QQ头像选择弹出的对话框)
    Toolbar和menu使用
    LIB和DLL的区别与使用
    C++调用webservice
    夯实Java基础系列5:Java文件和Java包结构
  • 原文地址:https://www.cnblogs.com/yurunmiao/p/4893946.html
Copyright © 2011-2022 走看看