zoukankan      html  css  js  c++  java
  • emr-spark

    查spark version:
    spark-sql --version


    spark的开源包: apache的dist下载
    spark-2.4.3-bin-hadoop2.8.tgz

    1/ spark要访问s3需要
    cp /usr/lib/hadoop-current/share/hadoop/tools/lib/*aws* /usr/lib/spark-current/jars/
    socket问题:
    wget https://repo1.maven.org/maven2/org/apache/httpcomponents/httpclient/4.3.6/httpclient-4.3.6.jar
    把/usr/lib/spark-current/jars/下面的httpclient-4.5.6.jar换成httpclient-4.3.6.jar就行了

    #刚开始运行时报 socket not created by this factory, 先是想到替换spark的包,还是不行,替换httpclitent就可以了.

    <property>
    <name>fs.s3a.access.key</name>
    <value>access.key</value>
    </property>
    <property>
    <name>fs.s3a.secret.key</name>
    <value>secret.key</value>
    </property>
    <property>
    <name>fs.s3a.impl</name>
    <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
    </property>

    #振乾发的:emr hadoop访问s3
    sudo cp /usr/lib/hadoop-current/share/hadoop/tools/lib/*aws* /usr/lib/hadoop-current/share/hadoop/common/lib/
    sudo cp /usr/lib/hadoop-current/share/hadoop/tools/lib/*jackson* /usr/lib/hadoop-current/share/hadoop/common/lib/
    sudo cp /usr/lib/hadoop-current/share/hadoop/tools/lib/joda-time-2.9.4.jar /usr/lib/hadoop-current/share/hadoop/common/lib/


    #测试例子

    例子1 测访问s3
    pyspark --queue algo_spark

    data = spark.sql("""
    select * from oride_source.order_driver_feature_new where dt="{dt}" and hour="10"
    """.format(dt="2020-01-09"))

    data.take(2)

    例子2 测访问s3
    su - hdfs
    cd /home/hdfs/bowen.wang/algo-offline-job/direct_dispatch_info
    PYTHONPATH=./ /usr/lib/spark-current/bin/spark-submit --queue algo_spark ./all_city_driver_info.py prod

    例子3:指定参数的pyspark

    pyspark --queue algo_spark --num-executors 15 --executor-cores 5 --executor-memory 10G --driver-memory 10G

    例子4  测试json有没有

    pyspark --queue data_bi --master yarn-client
    tmp = """ select d_h_id, d_h_od from oride_source.order_driver_feature_new where dt="2020-01-19" and hour="00" """
    data = spark.sql(tmp).rdd
    data.take(2)
    [Row(d_h_id=1, d_h_od=None), Row(d_h_id=2, d_h_od=None)]

    例子5 用spark sql直接在header2上用spark-sql 或hue上选spark sql

    use opay_dw;
    select * from dwd_opay_topup_with_card_record_di where dt='2020-02-05';

  • 相关阅读:
    设计模式复习-简单工厂模式
    神经网络与机器学习 笔记—基本知识点(上)
    声明:songzijian这个域名已经被抢注。大家别上了。不是我了。
    《NO STRATEGY》《重塑战略》
    《THE LEAN STARTUP》 《精益创业》
    逆向与分析-WebBrowserPassView消息分析
    HTTP1.0,1.1,2.0,HTTPS
    Linux-鸟菜-7-Linux文件系统-EXT
    Linux-鸟菜-6-文件搜索
    反弹代理环境的搭建
  • 原文地址:https://www.cnblogs.com/hongfeng2019/p/12179315.html
Copyright © 2011-2022 走看看