zoukankan      html  css  js  c++  java
  • pyspark RandomForestRegressor 随机森林回归

    #!/usr/bin/env python3
    # -*- coding: utf-8 -*-
    """
    Created on Fri Jun  8 09:27:08 2018
    
    @author: luogan
    """
    
    from pyspark.ml import Pipeline
    from pyspark.ml.regression import RandomForestRegressor
    from pyspark.ml.feature import VectorIndexer
    from pyspark.ml.evaluation import RegressionEvaluator
    
    from pyspark.sql import SparkSession
    
    spark= SparkSession
                    .builder 
                    .appName("dataFrame") 
                    .getOrCreate()
    
    # Load and parse the data file, converting it to a DataFrame.
    data = spark.read.format("libsvm").load("/home/luogan/lg/softinstall/spark-2.2.0-bin-hadoop2.7/data/mllib/sample_libsvm_data.txt")
    
    # Automatically identify categorical features, and index them.
    # Set maxCategories so features with > 4 distinct values are treated as continuous.
    featureIndexer =
        VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
    
    # Split the data into training and test sets (30% held out for testing)
    (trainingData, testData) = data.randomSplit([0.7, 0.3])
    
    # Train a RandomForest model.
    rf = RandomForestRegressor(featuresCol="indexedFeatures")
    
    # Chain indexer and forest in a Pipeline
    pipeline = Pipeline(stages=[featureIndexer, rf])
    
    # Train model.  This also runs the indexer.
    model = pipeline.fit(trainingData)
    
    # Make predictions.
    predictions = model.transform(testData)
    
    # Select example rows to display.
    predictions.select("prediction", "label", "features").show(5)
    
    # Select (prediction, true label) and compute test error
    evaluator = RegressionEvaluator(
        labelCol="label", predictionCol="prediction", metricName="rmse")
    rmse = evaluator.evaluate(predictions)
    print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)
    
    rfModel = model.stages[1]
    print(rfModel)  # summary only
    

     结果:

    +----------+-----+--------------------+
    |prediction|label|            features|
    +----------+-----+--------------------+
    |       0.0|  0.0|(692,[95,96,97,12...|
    |       0.3|  0.0|(692,[100,101,102...|
    |       0.0|  0.0|(692,[123,124,125...|
    |      0.05|  0.0|(692,[124,125,126...|
    |       0.0|  0.0|(692,[124,125,126...|
    +----------+-----+--------------------+
    only showing top 5 rows
    
    Root Mean Squared Error (RMSE) on test data = 0.127949
    RandomForestRegressionModel (uid=RandomForestRegressor_4acc9ab165e4f84f7169) with 20 trees
    

      

    原文:https://blog.csdn.net/luoganttcc/article/details/80618336

    PySpark 分类模型训练 参考:

    https://blog.csdn.net/u013719780/article/details/51792097

  • 相关阅读:
    C#的泛型的类型参数可以有带参数的构造函数的约束方式吗
    Object Pool
    链表排序(冒泡、选择、插入、快排、归并、希尔、堆排序)
    ASP.NET MVC应用程序使用axd格式文件
    框架设计--服务总线
    Mylyn
    不要问我能赚多少,当你问这句话的时候,你的思想还停留在给别人打工的阶段,我只是你的仓库,能挣多少得问你自己想挣多少(转)
    maven中如果使用本地jar
    君子和而不同,小人同而不和
    失去控制,并随之失去指挥
  • 原文地址:https://www.cnblogs.com/Allen-rg/p/10046583.html
Copyright © 2011-2022 走看看