zoukankan      html  css  js  c++  java
  • Spark 学习笔记

    1、导入配置是spark

    from pyspark import SparkConf, SparkContext
    
    conf = SparkConf().setMaster("local").setAppName("My App")
    sc = SparkContext(conf=conf)

    2、创建RDD

      # 第一种
      A = [1,2,3,4,5]
      lines = sc.parallelize(A)
      # 另一种方式
      lines = sc.parallelize([1,2,3,4,5])
      # 第三种
      lines = sc.textFile("Demo.txt")

    3、tex文件练习

    There were a sensitivity and a beauty to her that have nothing to do with looks. 
    She was one to be listened to, whose words were so easy to take to heart.
    It is said that the true nature of being is veiled. 
    The labor of words, the expression of art
    I used to find notes left in the collection basket,
    beautiful notes about my homilies and about the writer's thoughts on the daily scriptural readings. 
    It was a long time before I met the author of the notes.
    One Sunday morning, I was told that someone was waiting for me in the office. 
    We chatted for a while that Sunday morning and agreed to meet for lunch later that week.
    As it turned out we went to lunch several times, and she always wore a hat during the meal.
    We spoke of authors we both had read, and it was easy to tell that books are a great love of hers.
    I have thought about her often over the years and how she struggled in a society that places an incredible premium on looks
    Would her life have been different had she been pretty? Chances are it would have.
    How long does it take most of us to reach that level of human growth, if we ever get there? We get so consumed and diminished,
    The truth of her life was a desire to see beyond the surface for a glimpse of what it is that matters. 
    She found beauty and grace and they befriended her, and showed her what is real
    wnagnan is good
    huxue is beautiful
    we are good

    4、在Python中使用第一个单词作为键创建一个pairRDD,使用map()函数

    pairs = lines.map(lambda x: (x.split(" ")[0], x))

    5、print打印

    pairs.foreach(print)
    ('There', 'There were a sensitivity and a beauty to her that have nothing to do with looks. ')
    ('She', 'She was one to be listened to, whose words were so easy to take to heart.')
    ('It', 'It is said that the true nature of being is veiled. ')
    ('The', 'The labor of words, the expression of art')
    ('I', 'I used to find notes left in the collection basket,')
    ('beautiful', "beautiful notes about my homilies and about the writer's thoughts on the daily scriptural readings. ")
    ('It', 'It was a long time before I met the author of the notes.')
    ('One', 'One Sunday morning, I was told that someone was waiting for me in the office. ')
    ('We', 'We chatted for a while that Sunday morning and agreed to meet for lunch later that week.')
    ('As', 'As it turned out we went to lunch several times, and she always wore a hat during the meal.')
    ('We', 'We spoke of authors we both had read, and it was easy to tell that books are a great love of hers.')
    ('I', 'I have thought about her often over the years and how she struggled in a society that places an incredible premium on looks')
    ('Would', 'Would her life have been different had she been pretty? Chances are it would have.')
    ('How', 'How long does it take most of us to reach that level of human growth, if we ever get there? We get so consumed and diminished,')
    ('The', 'The truth of her life was a desire to see beyond the surface for a glimpse of what it is that matters. ')
    ('She', 'She found beauty and grace and they befriended her, and showed her what is real')
    ('wnagnan', 'wnagnan is good')
    ('huxue', 'huxue is beautiful')
    ('we', 'we are good')

    转化-------------------------->

    6、用Python对第二个元素进行筛选

    result = pairs.filter(lambda keyValue: len(keyValue[1]) < 20)
    result.foreach(print)
    ('wnagnan', 'wnagnan is good')
    ('huxue', 'huxue is beautiful')
    ('we', 'we are good')

    7、用Python实现单词计数

    words = lines.flatMap(lambda x: x.split(" "))
    result1 = words.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x+y)
    ('There', 1)
    ('were', 2)
    ('a', 9)
    ('sensitivity', 1)
    ('and', 10)
    ('beauty', 2)
    ('to', 11)
    ('her', 5)
    ('that', 9)
    ('have', 3)
    ('nothing', 1)
     ......

    8.1、在Python中使用reduceByKey()和mapValues()计算每个键对应的平均值

    mapValues(function) 原RDD中的Key保持不变,与新的Value一起组成新的RDD中的元素

    reduceByKey(func) 是pairRDD的转化操作,目的是合并具有相同键的值。

    means = rusult1.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
    ('There', (1, 1))
    ('were', (2, 1))
    ('a', (9, 1))
    ('sensitivity', (1, 1))
    ('and', (10, 1))
    ('beauty', (2, 1))
    ('to', (11, 1))
    ('her', (5, 1))
    ('that', (9, 1))
    ('have', (3, 1))
    ('nothing', (1, 1))
    ......

    8.2、在Python中使用combineByKey()求每个键对应的平均值

    combineByKey相关理解见  https://www.cnblogs.com/rigid/p/5563205.html

    https://www.zhihu.com/question/33798481

    sumCount = result1.combineByKey((lambda x: (x, 1)), (lambda x, y: (x[0] + y, x[1] + 1)), (lambda x, y: (x[0] + y[0], x[1] + y[1])))
    ('There', (1, 1))
    ('were', (2, 1))
    ('a', (9, 1))
    ('sensitivity', (1, 1))
    ('and', (10, 1))
    ('beauty', (2, 1))
    ('to', (11, 1))
    ('her', (5, 1))
    ('that', (9, 1))
    ('have', (3, 1))
    ('nothing', (1, 1))
    ......

    8.3、计算平均值

    方法一:
    avg = sumCount.map(lambda keyxy: (keyxy[0], keyxy[1][0] / keyxy[1][1])).collectAsMap()
    print(avg["There"])
    1.0

    方法二: avg = sumCount.map(lambda keyxy: (keyxy[0], keyxy[1][0] / keyxy[1][1])) print(avg.first()) print(avg.getNumPartitions()) ('There', 1.0) 1

    9、在Python中自定义reduceByKey()的并行度

    data = [("a", 3), ("b", 4), ("a", 1)]
    sc.parallelize(data).reduceByKey(lambda x, y: x + y)  # 默认并行度
    sc.parallelize(data).reduceByKey(lambda x, y: x + y, 10)  # 自定义并行度

    10、在Python中以字符串顺序对整数进行自定义排序

    rdd = sc.parallelize(data)
    sort_data = rdd.sortByKey(ascending=True, numPartitions=None, keyfunc=lambda x: str(x))
    print_rdd(sort_data)
    ('a', 3)
    ('a', 1)
    ('b', 4)

    11、数据的读取与保存

    #读取文本文件 
    input=sc.textFile("文件地址") 
    #保存文本文件 
    result.saveAsTextFile(outputFile
    #用textFile读取csv
    import csv
    import StringIO
    def loadRecord(line):
        """解析一行csv记录"""
        input = StringIO.StringIO(line)
        reader = csv.DictReader(input,filenames =["name","favouriteAnimal"])
        return reader.next()
    input = sc.textFile(inputFile).map(loadRecord)
    
    #读取完整csv
    def loadRecords(filenameContents):
        """读取给定文件中的所有记录"""
        input  = StringIO.StringIO(filenameContents[1])
        reader = csv.DictReader(input,fieldnames = ["name","favouriteAnimal"])
        return reader
    fullFileData = sc.wholeTextFiles(inputFile).flatMap(loadRecords)
    
    #保存csv
    def writeRecords(records):
        """写出一些csv记录"""
        output = StringIO.StringIO()
        writer = csv.DictReader(output,filenames = ["name","favouriteAnimal"])
        for record in records:
            writer.writerow(record)
        return [output.getvalue()]
    pandaLovers.mapPartitions(writeRecords).saveAsTextFile(outputFile)

    12、累加器:在Python中累加空行

    file = sc.textFile("Demo.txt")
    # 创建Accumulator[int] 并初始化为0
    
    global blankLines  # 访问全局变量
    blankLines = sc.accumulator(0)
    
    def extractCallSigns(line):
        global blankLines  # 访问全局变量
        if (line == ""):
            blankLines += 1
        return line.split(" ")
    
    callSigns = file.flatMap(extractCallSigns)
    callSigns.saveAsTextFile("spark_output/callSigns")
    print("
    ")
    print("Blank Lines:%d " % blankLines.value)
    print("
    ")

     Blank Lines:3

    13、

  • 相关阅读:
    10个超实用的PHP代码片段
    MySQL支撑百万级流量高并发的网站部署详解
    程序员总结:帮助你早些明白一些道理
    50个最常用的UNIX / Linux命令(结合实例)
    php.ini 核心配置选项说明
    智能指针的死穴 循环引用
    滥用vector带来的瓶颈
    JS——层的遮罩效果
    【趣】无广告看视频
    【SQLServer】远程访问数据库进行配置
  • 原文地址:https://www.cnblogs.com/rnanprince/p/10900260.html
Copyright © 2011-2022 走看看