zoukankan      html  css  js  c++  java
  • spark学习---hello world

    《spark入门学习》

    1. 下载spark

    http://archive.apache.org/dist/spark/spark-2.4.4/

    2. 解压spark-2.4.4-bin-hadoop2.7.tgz 到 /Users/yyc121/software/

    3. 配置pyspark全局变量

      3.1 vim ~/.bash_profile

      3.2 增加:alias pyspark="/Users/yyc121/software/spark-2.4.4-bin-hadoop2.7/bin/pyspark" 

    4. 验证pyspark安装完成,输入pyspark

    5. 将/Users/yyc121/software/spark-2.4.4-bin-hadoop2.7/bin/目录下的pyspark,拷贝到python 的site-packages目录下

    6. 配置pycharm python解释器,运行环境配置

    6.1 python解释器配置

     6.2 运行环境配置

      

    7. 写个统计词频脚本,测试

    helloword.py

    # -*- coding: utf-8 -*-
    import sys
    from pyspark import SparkContext
    from operator import add
    import  re
    
    def main():
        sc = SparkContext(appName= "wordsCount")
        lines = sc.textFile('words.txt')
        counts = lines.flatMap(lambda  x: x.split(' '))
                    .map( lambda  x : (x, 1))
                    .reduceByKey(add)
        output = counts.collect()
        print(output)
        for (word, count) in output:
            print ("%s: %i" %(word, count))
    
        sc.stop()
    
    if __name__ =="__main__":
        main()

    words.txt

    The dynamic lifestyle
    people lead nowadays
    causes many reactions
    in our bodies and
    the one that is the
    most frequent of all
    is the headache

    7. pycharm直接运行

    8. 结果展示

  • 相关阅读:
    stream流的统计demo
    ResourceBundle 读取文件demo
    spring boot 配置Filter过滤器的两种方式
    java工厂模式demo
    ThreadLocalDemo
    观察者模式Demo
    大数字的计算
    rabbitMQ消息丢失
    CF671E(线段树+单调栈)
    2020集训队作业板刷记录(三)
  • 原文地址:https://www.cnblogs.com/syw-home/p/13952452.html
Copyright © 2011-2022 走看看