zoukankan      html  css  js  c++  java
  • spark 从RDD createDataFrame 的坑

    Scala:

    import org.apache.spark.ml.linalg.Vectors
    
    val data = Seq(
      (7, Vectors.dense(0.0, 0.0, 18.0, 1.0), 1.0),
      (8, Vectors.dense(0.0, 1.0, 12.0, 0.0), 0.0),
      (9, Vectors.dense(1.0, 0.0, 15.0, 0.1), 0.0)
    )
    
    val df = spark.createDataset(data).toDF("id", "features", "clicked")
    

    Python:

    from pyspark.ml.linalg import Vectors
    
    df = spark.createDataFrame([
        (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
        (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
        (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", "clicked"])
    
    如果是pair rdd则:

        stratified_CV_data = training_data.union(test_data) #pair rdd
        #schema = StructType([
        #   StructField("label", IntegerType(), True),
        #   StructField("features", VectorUDT(), True)])
        vectorized_CV_data = sqlContext.createDataFrame(stratified_CV_data, ["label", "features"]) #,schema) 

    因为spark交叉验证的数据集必须是data frame,也是醉了!

  • 相关阅读:
    MySQL用户
    python -- 中
    Python -- 上
    Zabbix导入数据库时报错
    chmod无执行权限
    编译安装Tengine
    Nginx的调度算法
    nginx--第三方模块echo
    ngx_http_headers_module
    nginx---隐藏或添加后端服务器的信息
  • 原文地址:https://www.cnblogs.com/bonelee/p/7805358.html
Copyright © 2011-2022 走看看