zoukankan      html  css  js  c++  java
  • spark hive 数据不一致 spark默认本地数据元 spark不能插入hive数据

    场景:spark+hive采用客户端和服务端分离的模式,客户端启动spark-sql 或者spark-submit、spark-shell 操作的都是本地数据源,无论服务端hive有没有启动,烦恼了一周,终于有了解决办法。

    问题重现:采用spark-submit提交的方式

    conf = (SparkConf().setAppName("My app"))
    sc = SparkContext(conf = conf)
    hive_context = HiveContext(sc)
    hive_context.sql(''' show tables ''').show()

      --------+---------+-----------+
      |database|tableName|isTemporary|
      +--------+---------+-----------+
      | default| camera| false|
      | default| src| false|
      +--------+---------+-----------+

    hive_context.sql(''' select * from camera ''').show()

      +---+-------+---------+
      | id|test_id|camera_id|
      +---+-------+---------+
      +---+-------+---------+

    hive_context.sql(''' insert into table camera values(1,"3","145") ''').show()

      ++
      ||
      ++
      ++

    hive_context.sql(''' select * from camera ''').show()

      +---+-------+---------+
      | id|test_id|camera_id|
      +---+-------+---------+
      +---+-------+---------+

      同样采用spark-sql 也看不到hive的表  

    解决方法:

    在pyspark编程时加上配置信息,怀疑是spark启动时并没有读取conf下的hive—sit.xml文件

    spark = SparkSession 
        .builder 
        .appName("Python Spark SQL Hive integration example") 
        .config("spark.sql.warehouse.dir", "/usr/hive/warehouse") 
            .config("hive.metastore.uris","thrift://slave1:9083") 
        .config("fs.defaultFS","hdfs://master:9000") 
        .enableHiveSupport() 
        .getOrCreate()
    
    # spark is an existing SparkSession
    spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")
    
    spark.sql(''' insert into table src values(1,"145") ''').show()
    
    # Queries are expressed in HiveQL
    spark.sql("SELECT * FROM src").show()

      +---+-----+
      |key|value|
      +---+-----+
      | 1| 145|
      +---+-----+

      终于有了数据,注意服务器需要先后台开启 nohup hive --service metastore &

    最后附上服务端和客户端的hive-site.xml文件

    服务端slave1 (/usr/hive/conf/)hive-site.xml

    <property>
      <name>javax.jdo.option.ConnectionURL</name>
      <value>jdbc:mysql://master:3306/metastore?createDatabaseIfNotExist=true</value>
      <description>the URL of the MySQL database</description>
    </property>
    
    <property>
      <name>javax.jdo.option.ConnectionDriverName</name>
      <value>com.mysql.jdbc.Driver</value>
    </property>
    
    <property>
      <name>javax.jdo.option.ConnectionUserName</name>
      <value>root</value>
    </property>
    
    <property>
      <name>javax.jdo.option.ConnectionPassword</name>
      <value>123456</value>
    </property>
    
    <property>
      <name>datanucleus.autoCreateSchema</name>
      <value>false</value>
    </property>
    
    <property>
      <name>datanucleus.fixedDatastore</name>
      <value>true</value>
    </property>
    
    <property>
      <name>datanucleus.autoStartMechanism</name> 
      <value>SchemaTable</value>
    </property> 
    
    <property>
    <name>hive.metastore.schema.verification</name>
    <value>true</value>
    </property>
    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>/usr/hive/warehouse</value>
        <description>location of default database for the warehouse</description>
      </property>
    
    </configuration>
    

      客户端master (/usr/spark/conf/ && /usr/hive/conf/ )hive-site.xml

    <property>
    <name>hive.metastore.uris</name>
    <value>thrift://slave1:9083</value>
    <description></description>
    </property>
    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>/usr/hive/warehouse</value>
        <description>location of default database for the warehouse</description>
      </property>
    <property>
     <name>hive.exec.scratchdir</name>
     <value>/usr/hive/tmp</value>
    </property>
    
    <property>
     <name>hive.querylog.location</name>
     <value>/usr/hive/log</value>
    </property>
    </configuration>
    

      配置并不是关键,还有可能是错的,但在编写pyspark程序时,必须加上那三句配置

    完!

  • 相关阅读:
    蓝绿发布、灰度发布和滚动发布
    linux redis 设置密码:
    redis配置文件讲解
    17 个方面,综合对比 Kafka、RabbitMQ、RocketMQ、ActiveMQ 四个分布式消息队列
    CentOS 7下安装Redis
    压力测试工具介绍
    【k8s部署kong一】kong k8s 安装 以及可视化管理界面
    Jmeter连接ORACLE数据库
    Jenkins安装插件提示实例似乎已离线问题解决办法
    Fiddler增加显示请求响应时间列
  • 原文地址:https://www.cnblogs.com/yangchas/p/14927671.html
Copyright © 2011-2022 走看看