zoukankan      html  css  js  c++  java
  • emr-spark

    查spark version:
    spark-sql --version


    spark的开源包: apache的dist下载
    spark-2.4.3-bin-hadoop2.8.tgz

    1/ spark要访问s3需要
    cp /usr/lib/hadoop-current/share/hadoop/tools/lib/*aws* /usr/lib/spark-current/jars/
    socket问题:
    wget https://repo1.maven.org/maven2/org/apache/httpcomponents/httpclient/4.3.6/httpclient-4.3.6.jar
    把/usr/lib/spark-current/jars/下面的httpclient-4.5.6.jar换成httpclient-4.3.6.jar就行了

    #刚开始运行时报 socket not created by this factory, 先是想到替换spark的包,还是不行,替换httpclitent就可以了.

    <property>
    <name>fs.s3a.access.key</name>
    <value>access.key</value>
    </property>
    <property>
    <name>fs.s3a.secret.key</name>
    <value>secret.key</value>
    </property>
    <property>
    <name>fs.s3a.impl</name>
    <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
    </property>

    #振乾发的:emr hadoop访问s3
    sudo cp /usr/lib/hadoop-current/share/hadoop/tools/lib/*aws* /usr/lib/hadoop-current/share/hadoop/common/lib/
    sudo cp /usr/lib/hadoop-current/share/hadoop/tools/lib/*jackson* /usr/lib/hadoop-current/share/hadoop/common/lib/
    sudo cp /usr/lib/hadoop-current/share/hadoop/tools/lib/joda-time-2.9.4.jar /usr/lib/hadoop-current/share/hadoop/common/lib/


    #测试例子

    例子1 测访问s3
    pyspark --queue algo_spark

    data = spark.sql("""
    select * from oride_source.order_driver_feature_new where dt="{dt}" and hour="10"
    """.format(dt="2020-01-09"))

    data.take(2)

    例子2 测访问s3
    su - hdfs
    cd /home/hdfs/bowen.wang/algo-offline-job/direct_dispatch_info
    PYTHONPATH=./ /usr/lib/spark-current/bin/spark-submit --queue algo_spark ./all_city_driver_info.py prod

    例子3:指定参数的pyspark

    pyspark --queue algo_spark --num-executors 15 --executor-cores 5 --executor-memory 10G --driver-memory 10G

    例子4  测试json有没有

    pyspark --queue data_bi --master yarn-client
    tmp = """ select d_h_id, d_h_od from oride_source.order_driver_feature_new where dt="2020-01-19" and hour="00" """
    data = spark.sql(tmp).rdd
    data.take(2)
    [Row(d_h_id=1, d_h_od=None), Row(d_h_id=2, d_h_od=None)]

    例子5 用spark sql直接在header2上用spark-sql 或hue上选spark sql

    use opay_dw;
    select * from dwd_opay_topup_with_card_record_di where dt='2020-02-05';

  • 相关阅读:
    实现LNMP
    iptabes的用法
    实现LAMP
    实现https
    硬盘信息和磁盘分区管理
    awk
    【BZOJ 2288】 2288: 【POJ Challenge】生日礼物 (贪心+优先队列+双向链表)
    【BZOJ 1150】 1150: [CTSC2007]数据备份Backup (贪心+优先队列+双向链表)
    【BZOJ 1216】 1216: [HNOI2003]操作系统 (模拟+优先队列)
    【BZOJ 1528】 1528: [POI2005]sam-Toy Cars (贪心+堆)
  • 原文地址:https://www.cnblogs.com/hongfeng2019/p/12179315.html
Copyright © 2011-2022 走看看