zoukankan      html  css  js  c++  java
  • 使用IDEA开发Spark程序

      一、分布式估算圆周率

      1.计算原理

      假设正方形的面积S等于x²,而正方形的内切圆的面积C等于Pi×(x/2)²,因此圆面积与正方形面积之比C/S就为Pi/4,于是就有Pi=4×C/S。

      可以利用计算机随机产生大量位于正方形内部的点,通过点的数量去近似表示面积。假设位于正方形中点的数量为Ps,落在圆内的点的数量为Pc,则随机点的数量趋近于无穷时,4×Pc/Ps将逼近于Pi。

      2.IDEA下直接运行

      (1)启动IDEA,Create New Project-Scala-选择JDK和Scala SDK(Create-Browse-/home/jun/scala-2.12.6/lib下的所有jar包)-Finish

      (2)右键src-New-Package-输入com.jun-OK  

      (3)File-Project Structure-Libraries-+Java-/home/jun/spark-2.3.1-bin-hadoop2.7-jars下的所有jar包-OK

      (4)右键com.jun - Name(sparkPi)- Kind(Object)- OK,在编辑区写入下面的代码

    package com.jun
    
    import scala.math.random
    import org.apache.spark._
    
    object sparkPi {
      def main(args: Array[String]){
        val conf = new SparkConf().setAppName("spark Pi")
        val spark = new SparkContext(conf)
        val slices = if (args.length > 0) args(0).toInt else 2
        val n = 100000 * slices
        val count = spark.parallelize(1 to n, slices).map { i =>
          val x = random * 2 - 1
          val y = random * 2 - 1
          if (x*x + y*y < 1) 1 else 0
        }.reduce(_ + _)
        println("Pi is roughly " + 4.0 * count / n)
        spark.stop()
      }
    }

      (5)Run-Edit Configuration-+-Application-写入下面的运行参数配置-OK

      (6)右键单击代码编辑区-Run sparkPi

      出现了一个错误,这个问题是因为版本不匹配导致的,通过查看Spark官网可以看到,spark-2.3.1仅支持scala-2.11.x所以要将scala换成2.11版本。

    Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps;
        at org.apache.spark.internal.config.ConfigHelpers$.stringToSeq(ConfigBuilder.scala:48)
        at org.apache.spark.internal.config.TypedConfigBuilder$$anonfun$toSequence$1.apply(ConfigBuilder.scala:124)
        at org.apache.spark.internal.config.TypedConfigBuilder$$anonfun$toSequence$1.apply(ConfigBuilder.scala:124)
        at org.apache.spark.internal.config.TypedConfigBuilder.createWithDefault(ConfigBuilder.scala:142)
        at org.apache.spark.internal.config.package$.<init>(package.scala:152)
        at org.apache.spark.internal.config.package$.<clinit>(package.scala)
        at org.apache.spark.SparkConf$.<init>(SparkConf.scala:668)
        at org.apache.spark.SparkConf$.<clinit>(SparkConf.scala)
        at org.apache.spark.SparkConf.set(SparkConf.scala:94)
        at org.apache.spark.SparkConf$$anonfun$loadFromSystemProperties$3.apply(SparkConf.scala:76)
        at org.apache.spark.SparkConf$$anonfun$loadFromSystemProperties$3.apply(SparkConf.scala:75)
        at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:789)
        at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:231)
        at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:462)
        at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:462)
        at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:788)
        at org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:75)
        at org.apache.spark.SparkConf.<init>(SparkConf.scala:70)
        at org.apache.spark.SparkConf.<init>(SparkConf.scala:57)
        at com.jun.sparkPi$.main(sparkPi.scala:8)
        at com.jun.sparkPi.main(sparkPi.scala)
    
    Process finished with exit code 1

      Spark官网在spark2.3.1版本介绍中有这么一段说明,于是将scala版本换成2.11.8,然而又由于idea和scala插件版本不对应,最后决定采取联网安装scala插件的办法。

    Spark runs on Java 8+, Python 2.7+/3.4+ and R 3.1+. For the Scala API, Spark 2.3.1 uses Scala 2.11. You will need to use a compatible Scala version (2.11.x).

       然后再执行,在一对日志文本中找到输出的结果:

    2018-07-24 11:00:17 INFO  DAGScheduler:54 - ResultStage 0 (reduce at sparkPi.scala:16) finished in 0.779 s
    2018-07-24 11:00:17 INFO  DAGScheduler:54 - Job 0 finished: reduce at sparkPi.scala:16, took 1.286323 s
    Pi is roughly 3.13792
    2018-07-24 11:00:18 INFO  AbstractConnector:318 - Stopped Spark@2c9399a4{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
    2018-07-24 11:00:18 INFO  BlockManagerInfo:54 - Removed broadcast_0_piece0 on master:35290 in memory (size: 1176.0 B, free: 323.7 MB)
    

      3.分布式运行前的准备

      分布式运行是指在客户端以命令行方式向Spark集群提交jar包的运行方式,所以需要将上述程序变异成jar包。

      (1)File-Project Structure-Artifacts-+-jar-From modules with dependencies-将Main Class设置为com.jun.sparkPi-OK-在Output Layout下只留下一个compile output-OK

      (2)Build-Build Artifacts-Build

      (3)将输出的jar包复制到Spark安装目录下

    [jun@master bin]$ cp /home/jun/IdeaProjects/sparkAPP/out/artifacts/sparkAPP_jar/sparkAPP.jar /home/jun/spark-2.3.1-bin-hadoop2.7/

      4.分布式运行

      (1)本地模式

    [jun@master bin]$ /home/jun/spark-2.3.1-bin-hadoop2.7/bin/spark-submit --master local --class com.jun.sparkPi /home/jun/spark-2.3.1-bin-hadoop2.7/sparkAPP.jar 

      结果为本地命令行输出:

    2018-07-24 11:12:21 INFO  TaskSetManager:54 - Finished task 1.0 in stage 0.0 (TID 1) in 34 ms on localhost (executor driver) (2/2)
    2018-07-24 11:12:21 INFO  DAGScheduler:54 - ResultStage 0 (reduce at sparkPi.scala:16) finished in 1.591 s
    2018-07-24 11:12:21 INFO  TaskSchedulerImpl:54 - Removed TaskSet 0.0, whose tasks have all completed, from pool 
    2018-07-24 11:12:21 INFO  DAGScheduler:54 - Job 0 finished: reduce at sparkPi.scala:16, took 1.833831 s
    Pi is roughly 3.14082
    2018-07-24 11:12:21 INFO  AbstractConnector:318 - Stopped Spark@285f09de{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
    2018-07-24 11:12:21 INFO  SparkUI:54 - Stopped Spark web UI at http://master:4040
    2018-07-24 11:12:21 INFO  MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
    2018-07-24 11:12:21 INFO  MemoryStore:54 - MemoryStore cleared
    2018-07-24 11:12:21 INFO  BlockManager:54 - BlockManager stopped

      (2)Hadoop Yarn-cluster模式

    [jun@master spark-2.3.1-bin-hadoop2.7]$ bin/spark-submit --master yarn --deploy-mode cluster sparkAPP.jar 

      命令行返回处理信息:

    2018-07-24 11:17:14 INFO  Client:54 - 
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: 192.168.1.102
         ApplicationMaster RPC port: 0
         queue: default
         start time: 1532402191014
         final status: SUCCEEDED
         tracking URL: http://master:18088/proxy/application_1532394200431_0002/
         user: jun

      结果在Tracking URL里的logs中的stdout中查看

    2018-07-24 11:17:14 INFO  DAGScheduler:54 - ResultStage 0 (reduce at sparkPi.scala:16) finished in 0.910 s
    2018-07-24 11:17:14 INFO  DAGScheduler:54 - Job 0 finished: reduce at sparkPi.scala:16, took 0.970826 s
    Pi is roughly 3.14076
    2018-07-24 11:17:14 INFO  AbstractConnector:318 - Stopped Spark@76017b73{HTTP/1.1,[http/1.1]}{0.0.0.0:0}
    2018-07-24 11:17:14 INFO  SparkUI:54 - Stopped Spark web UI at http://slave1:41837
    2018-07-24 11:17:14 INFO  YarnAllocator:54 - Driver requested a total number of 0 executor(s).

      (3)Hadoop Yarn-client模式

    [jun@master spark-2.3.1-bin-hadoop2.7]$ bin/spark-submit --master yarn --deploy-mode client sparkAPP.jar 

      结果就在本地客户端查看

    2018-07-24 11:20:21 INFO  TaskSetManager:54 - Finished task 0.0 in stage 0.0 (TID 0) in 3592 ms on slave1 (executor 1) (2/2)
    2018-07-24 11:20:21 INFO  YarnScheduler:54 - Removed TaskSet 0.0, whose tasks have all completed, from pool 
    2018-07-24 11:20:21 INFO  DAGScheduler:54 - ResultStage 0 (reduce at sparkPi.scala:16) finished in 12.041 s
    2018-07-24 11:20:21 INFO  DAGScheduler:54 - Job 0 finished: reduce at sparkPi.scala:16, took 13.017473 s
    Pi is roughly 3.1387
    2018-07-24 11:20:22 INFO  AbstractConnector:318 - Stopped Spark@29a6924f{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
    2018-07-24 11:20:22 INFO  SparkUI:54 - Stopped Spark web UI at http://master:4040
    2018-07-24 11:20:22 INFO  YarnClientSchedulerBackend:54 - Interrupting monitor thread
    2018-07-24 11:20:22 INFO  YarnClientSchedulerBackend:54 - Shutting down all executors
    2018-07-24 11:20:22 INFO  YarnSchedulerBackend$YarnDriverEndpoint:54 - Asking each executor t

       5.代码分析

      TODO

      二、基于Spark MLlib的贷款风险预测

      1.计算原理

      有一个CSV文件,里面存储的是用户信用数据集。例如,

    1,1,18,4,2,1049,1,2,4,2,1,4,2,21,3,1,1,3,1,1,1
    1,1,9,4,0,2799,1,3,2,3,1,2,1,36,3,1,2,3,2,1,1
    1,2,12,2,9,841,2,4,2,2,1,4,1,23,3,1,1,2,1,1,1
    1,1,12,4,0,2122,1,3,3,3,1,2,1,39,3,1,2,2,2,1,2
    1,1,12,4,0,2171,1,3,4,3,1,4,2,38,1,2,2,2,1,1,2
    1,1,10,4,0,2241,1,2,1,3,1,3,1,48,3,1,2,2,2,1,2

      在用户信用度数据集里,每条样本用两个类别来标记,1(可信)和0(不可信),每个样本的特征包括21个字段,其中第一个字段1或0表示是否可信,另外20个特征字段分别为:存款、期限、历史记录、目的、数额、储蓄、是否在职、分期付款额、婚姻、担保人、居住时间、资产、年龄、历史信用、居住公寓、贷款、职业、监护人、是否有电话、外籍。

      其中运用了决策树模型和随机森林模型来对银行信用贷款的风险做分类预测。

      2.运行程序

      (1)在IDEA中新建Scala项目,包,类,配置Project SDK与Scala SDK,将csv文件复制到新建的项目下,将下面的代码复制到代码编辑区

    package com.jun
    
    import org.apache.spark._
    import org.apache.spark.rdd.RDD
    import org.apache.spark.sql.SQLContext
    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.types._
    import org.apache.spark.sql._
    import org.apache.spark.ml.classification.RandomForestClassifier
    import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
    import org.apache.spark.ml.feature.StringIndexer
    import org.apache.spark.ml.feature.VectorAssembler
    import org.apache.spark.ml.tuning.{ ParamGridBuilder, CrossValidator }
    import org.apache.spark.ml.{ Pipeline, PipelineStage }
    import org.apache.spark.mllib.evaluation.RegressionMetrics
    
    object Credit {
    
      case class Credit(
                         creditability: Double,
                         balance: Double, duration: Double, history: Double, purpose: Double, amount: Double,
                         savings: Double, employment: Double, instPercent: Double, sexMarried: Double, guarantors: Double,
                         residenceDuration: Double, assets: Double, age: Double, concCredit: Double, apartment: Double,
                         credits: Double, occupation: Double, dependents: Double, hasPhone: Double, foreign: Double
                       )
    
      def parseCredit(line: Array[Double]): Credit = {
        Credit(
          line(0),
          line(1) - 1, line(2), line(3), line(4), line(5),
          line(6) - 1, line(7) - 1, line(8), line(9) - 1, line(10) - 1,
          line(11) - 1, line(12) - 1, line(13), line(14) - 1, line(15) - 1,
          line(16) - 1, line(17) - 1, line(18) - 1, line(19) - 1, line(20) - 1
        )
      }
    
      def parseRDD(rdd: RDD[String]): RDD[Array[Double]] = {
        rdd.map(_.split(",")).map(_.map(_.toDouble))
      }
    
      def main(args: Array[String]) {
    
        val conf = new SparkConf().setAppName("SparkDFebay")
        val sc = new SparkContext(conf)
        val sqlContext = new SQLContext(sc)
        import sqlContext._
        import sqlContext.implicits._
    
        val creditDF = parseRDD(sc.textFile("germancredit.csv")).map(parseCredit).toDF().cache()
        creditDF.registerTempTable("credit")
        creditDF.printSchema
    
        creditDF.show
    
        sqlContext.sql("SELECT creditability, avg(balance) as avgbalance, avg(amount) as avgamt, avg(duration) as avgdur  FROM credit GROUP BY creditability ").show
    
        creditDF.describe("balance").show
        creditDF.groupBy("creditability").avg("balance").show
    
        val featureCols = Array("balance", "duration", "history", "purpose", "amount",
          "savings", "employment", "instPercent", "sexMarried", "guarantors",
          "residenceDuration", "assets", "age", "concCredit", "apartment",
          "credits", "occupation", "dependents", "hasPhone", "foreign")
        val assembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features")
        val df2 = assembler.transform(creditDF)
        df2.show
    
        val labelIndexer = new StringIndexer().setInputCol("creditability").setOutputCol("label")
        val df3 = labelIndexer.fit(df2).transform(df2)
        df3.show
        val splitSeed = 5043
        val Array(trainingData, testData) = df3.randomSplit(Array(0.7, 0.3), splitSeed)
    
        val classifier = new RandomForestClassifier().setImpurity("gini").setMaxDepth(3).setNumTrees(20).setFeatureSubsetStrategy("auto").setSeed(5043)
        val model = classifier.fit(trainingData)
    
        val evaluator = new BinaryClassificationEvaluator().setLabelCol("label")
        val predictions = model.transform(testData)
        model.toDebugString
    
        val accuracy = evaluator.evaluate(predictions)
        println("accuracy before pipeline fitting" + accuracy)
    
        val rm = new RegressionMetrics(
          predictions.select("prediction", "label").rdd.map(x =>
            (x(0).asInstanceOf[Double], x(1).asInstanceOf[Double]))
        )
        println("MSE: " + rm.meanSquaredError)
        println("MAE: " + rm.meanAbsoluteError)
        println("RMSE Squared: " + rm.rootMeanSquaredError)
        println("R Squared: " + rm.r2)
        println("Explained Variance: " + rm.explainedVariance + "
    ")
    
        val paramGrid = new ParamGridBuilder()
          .addGrid(classifier.maxBins, Array(25, 31))
          .addGrid(classifier.maxDepth, Array(5, 10))
          .addGrid(classifier.numTrees, Array(20, 60))
          .addGrid(classifier.impurity, Array("entropy", "gini"))
          .build()
    
        val steps: Array[PipelineStage] = Array(classifier)
        val pipeline = new Pipeline().setStages(steps)
    
        val cv = new CrossValidator()
          .setEstimator(pipeline)
          .setEvaluator(evaluator)
          .setEstimatorParamMaps(paramGrid)
          .setNumFolds(10)
    
        val pipelineFittedModel = cv.fit(trainingData)
    
        val predictions2 = pipelineFittedModel.transform(testData)
        val accuracy2 = evaluator.evaluate(predictions2)
        println("accuracy after pipeline fitting" + accuracy2)
    
        println(pipelineFittedModel.bestModel.asInstanceOf[org.apache.spark.ml.PipelineModel].stages(0))
    
        pipelineFittedModel
          .bestModel.asInstanceOf[org.apache.spark.ml.PipelineModel]
          .stages(0)
          .extractParamMap
    
        val rm2 = new RegressionMetrics(
          predictions2.select("prediction", "label").rdd.map(x =>
            (x(0).asInstanceOf[Double], x(1).asInstanceOf[Double]))
        )
    
        println("MSE: " + rm2.meanSquaredError)
        println("MAE: " + rm2.meanAbsoluteError)
        println("RMSE Squared: " + rm2.rootMeanSquaredError)
        println("R Squared: " + rm2.r2)
        println("Explained Variance: " + rm2.explainedVariance + "
    ")
    
      }
    }

      (2)编辑启动配置,Edit Configuration-Application-Name(Credit),Main Class(com.jun.Credit),Program arguments(/home/jun/IdeaProjects/Credit),VM options(-Dspark.master=local -Dspark.app.name=Credit -server -XX:PermSize=128M -XX:MaxPermSize=256M)

      (3)Run Credit

      (4)控制台输出结果为:日志INFO太多了,看不到啥。考虑将INFO日志隐藏,方法就是将spark安装文件夹下的默认日志配置文件拷贝到工程的src下并修改在控制台显示的日志的级别。

    [jun@master conf]$ cp /home/jun/spark-2.3.1-bin-hadoop2.7/conf/log4j.properties.template /home/jun/IdeaProjects/Credit/src/
    [jun@master conf]$ cd /home/jun/IdeaProjects/Credit/src/
    [jun@master src]$ mv log4j.properties.template log4j.properties
    [jun@master src]$ gedit log4j.properties 

      在日志的配置文件中修改日志级别,只将ERROR级别的日志输出在控制台

    log4j.rootCategory=ERROR, console

      再次运行,最后的结果为:

    Java HotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=128M; support was removed in 8.0
    Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=256M; support was removed in 8.0
    root
     |-- creditability: double (nullable = false)
     |-- balance: double (nullable = false)
     |-- duration: double (nullable = false)
     |-- history: double (nullable = false)
     |-- purpose: double (nullable = false)
     |-- amount: double (nullable = false)
     |-- savings: double (nullable = false)
     |-- employment: double (nullable = false)
     |-- instPercent: double (nullable = false)
     |-- sexMarried: double (nullable = false)
     |-- guarantors: double (nullable = false)
     |-- residenceDuration: double (nullable = false)
     |-- assets: double (nullable = false)
     |-- age: double (nullable = false)
     |-- concCredit: double (nullable = false)
     |-- apartment: double (nullable = false)
     |-- credits: double (nullable = false)
     |-- occupation: double (nullable = false)
     |-- dependents: double (nullable = false)
     |-- hasPhone: double (nullable = false)
     |-- foreign: double (nullable = false)
    
    +-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+
    |creditability|balance|duration|history|purpose|amount|savings|employment|instPercent|sexMarried|guarantors|residenceDuration|assets| age|concCredit|apartment|credits|occupation|dependents|hasPhone|foreign|
    +-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+
    |          1.0|    0.0|    18.0|    4.0|    2.0|1049.0|    0.0|       1.0|        4.0|       1.0|       0.0|              3.0|   1.0|21.0|       2.0|      0.0|    0.0|       2.0|       0.0|     0.0|    0.0|
    |          1.0|    0.0|     9.0|    4.0|    0.0|2799.0|    0.0|       2.0|        2.0|       2.0|       0.0|              1.0|   0.0|36.0|       2.0|      0.0|    1.0|       2.0|       1.0|     0.0|    0.0|
    |          1.0|    1.0|    12.0|    2.0|    9.0| 841.0|    1.0|       3.0|        2.0|       1.0|       0.0|              3.0|   0.0|23.0|       2.0|      0.0|    0.0|       1.0|       0.0|     0.0|    0.0|
    |          1.0|    0.0|    12.0|    4.0|    0.0|2122.0|    0.0|       2.0|        3.0|       2.0|       0.0|              1.0|   0.0|39.0|       2.0|      0.0|    1.0|       1.0|       1.0|     0.0|    1.0|
    |          1.0|    0.0|    12.0|    4.0|    0.0|2171.0|    0.0|       2.0|        4.0|       2.0|       0.0|              3.0|   1.0|38.0|       0.0|      1.0|    1.0|       1.0|       0.0|     0.0|    1.0|
    |          1.0|    0.0|    10.0|    4.0|    0.0|2241.0|    0.0|       1.0|        1.0|       2.0|       0.0|              2.0|   0.0|48.0|       2.0|      0.0|    1.0|       1.0|       1.0|     0.0|    1.0|
    |          1.0|    0.0|     8.0|    4.0|    0.0|3398.0|    0.0|       3.0|        1.0|       2.0|       0.0|              3.0|   0.0|39.0|       2.0|      1.0|    1.0|       1.0|       0.0|     0.0|    1.0|
    |          1.0|    0.0|     6.0|    4.0|    0.0|1361.0|    0.0|       1.0|        2.0|       2.0|       0.0|              3.0|   0.0|40.0|       2.0|      1.0|    0.0|       1.0|       1.0|     0.0|    1.0|
    |          1.0|    3.0|    18.0|    4.0|    3.0|1098.0|    0.0|       0.0|        4.0|       1.0|       0.0|              3.0|   2.0|65.0|       2.0|      1.0|    1.0|       0.0|       0.0|     0.0|    0.0|
    |          1.0|    1.0|    24.0|    2.0|    3.0|3758.0|    2.0|       0.0|        1.0|       1.0|       0.0|              3.0|   3.0|23.0|       2.0|      0.0|    0.0|       0.0|       0.0|     0.0|    0.0|
    |          1.0|    0.0|    11.0|    4.0|    0.0|3905.0|    0.0|       2.0|        2.0|       2.0|       0.0|              1.0|   0.0|36.0|       2.0|      0.0|    1.0|       2.0|       1.0|     0.0|    0.0|
    |          1.0|    0.0|    30.0|    4.0|    1.0|6187.0|    1.0|       3.0|        1.0|       3.0|       0.0|              3.0|   2.0|24.0|       2.0|      0.0|    1.0|       2.0|       0.0|     0.0|    0.0|
    |          1.0|    0.0|     6.0|    4.0|    3.0|1957.0|    0.0|       3.0|        1.0|       1.0|       0.0|              3.0|   2.0|31.0|       2.0|      1.0|    0.0|       2.0|       0.0|     0.0|    0.0|
    |          1.0|    1.0|    48.0|    3.0|   10.0|7582.0|    1.0|       0.0|        2.0|       2.0|       0.0|              3.0|   3.0|31.0|       2.0|      1.0|    0.0|       3.0|       0.0|     1.0|    0.0|
    |          1.0|    0.0|    18.0|    2.0|    3.0|1936.0|    4.0|       3.0|        2.0|       3.0|       0.0|              3.0|   2.0|23.0|       2.0|      0.0|    1.0|       1.0|       0.0|     0.0|    0.0|
    |          1.0|    0.0|     6.0|    2.0|    3.0|2647.0|    2.0|       2.0|        2.0|       2.0|       0.0|              2.0|   0.0|44.0|       2.0|      0.0|    0.0|       2.0|       1.0|     0.0|    0.0|
    |          1.0|    0.0|    11.0|    4.0|    0.0|3939.0|    0.0|       2.0|        1.0|       2.0|       0.0|              1.0|   0.0|40.0|       2.0|      1.0|    1.0|       1.0|       1.0|     0.0|    0.0|
    |          1.0|    1.0|    18.0|    2.0|    3.0|3213.0|    2.0|       1.0|        1.0|       3.0|       0.0|              2.0|   0.0|25.0|       2.0|      0.0|    0.0|       2.0|       0.0|     0.0|    0.0|
    |          1.0|    1.0|    36.0|    4.0|    3.0|2337.0|    0.0|       4.0|        4.0|       2.0|       0.0|              3.0|   0.0|36.0|       2.0|      1.0|    0.0|       2.0|       0.0|     0.0|    0.0|
    |          1.0|    3.0|    11.0|    4.0|    0.0|7228.0|    0.0|       2.0|        1.0|       2.0|       0.0|              3.0|   1.0|39.0|       2.0|      1.0|    1.0|       1.0|       0.0|     0.0|    0.0|
    +-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+
    only showing top 20 rows
    
    +-------------+------------------+------------------+------------------+
    |creditability|        avgbalance|            avgamt|            avgdur|
    +-------------+------------------+------------------+------------------+
    |          0.0|0.9033333333333333|3938.1266666666666|             24.86|
    |          1.0|1.8657142857142857| 2985.442857142857|19.207142857142856|
    +-------------+------------------+------------------+------------------+
    
    +-------+------------------+
    |summary|           balance|
    +-------+------------------+
    |  count|              1000|
    |   mean|             1.577|
    | stddev|1.2576377271108938|
    |    min|               0.0|
    |    max|               3.0|
    +-------+------------------+
    
    +-------------+------------------+
    |creditability|      avg(balance)|
    +-------------+------------------+
    |          0.0|0.9033333333333333|
    |          1.0|1.8657142857142857|
    +-------------+------------------+
    
    +-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+--------------------+
    |creditability|balance|duration|history|purpose|amount|savings|employment|instPercent|sexMarried|guarantors|residenceDuration|assets| age|concCredit|apartment|credits|occupation|dependents|hasPhone|foreign|            features|
    +-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+--------------------+
    |          1.0|    0.0|    18.0|    4.0|    2.0|1049.0|    0.0|       1.0|        4.0|       1.0|       0.0|              3.0|   1.0|21.0|       2.0|      0.0|    0.0|       2.0|       0.0|     0.0|    0.0|(20,[1,2,3,4,6,7,...|
    |          1.0|    0.0|     9.0|    4.0|    0.0|2799.0|    0.0|       2.0|        2.0|       2.0|       0.0|              1.0|   0.0|36.0|       2.0|      0.0|    1.0|       2.0|       1.0|     0.0|    0.0|(20,[1,2,4,6,7,8,...|
    |          1.0|    1.0|    12.0|    2.0|    9.0| 841.0|    1.0|       3.0|        2.0|       1.0|       0.0|              3.0|   0.0|23.0|       2.0|      0.0|    0.0|       1.0|       0.0|     0.0|    0.0|[1.0,12.0,2.0,9.0...|
    |          1.0|    0.0|    12.0|    4.0|    0.0|2122.0|    0.0|       2.0|        3.0|       2.0|       0.0|              1.0|   0.0|39.0|       2.0|      0.0|    1.0|       1.0|       1.0|     0.0|    1.0|[0.0,12.0,4.0,0.0...|
    |          1.0|    0.0|    12.0|    4.0|    0.0|2171.0|    0.0|       2.0|        4.0|       2.0|       0.0|              3.0|   1.0|38.0|       0.0|      1.0|    1.0|       1.0|       0.0|     0.0|    1.0|[0.0,12.0,4.0,0.0...|
    |          1.0|    0.0|    10.0|    4.0|    0.0|2241.0|    0.0|       1.0|        1.0|       2.0|       0.0|              2.0|   0.0|48.0|       2.0|      0.0|    1.0|       1.0|       1.0|     0.0|    1.0|[0.0,10.0,4.0,0.0...|
    |          1.0|    0.0|     8.0|    4.0|    0.0|3398.0|    0.0|       3.0|        1.0|       2.0|       0.0|              3.0|   0.0|39.0|       2.0|      1.0|    1.0|       1.0|       0.0|     0.0|    1.0|[0.0,8.0,4.0,0.0,...|
    |          1.0|    0.0|     6.0|    4.0|    0.0|1361.0|    0.0|       1.0|        2.0|       2.0|       0.0|              3.0|   0.0|40.0|       2.0|      1.0|    0.0|       1.0|       1.0|     0.0|    1.0|[0.0,6.0,4.0,0.0,...|
    |          1.0|    3.0|    18.0|    4.0|    3.0|1098.0|    0.0|       0.0|        4.0|       1.0|       0.0|              3.0|   2.0|65.0|       2.0|      1.0|    1.0|       0.0|       0.0|     0.0|    0.0|[3.0,18.0,4.0,3.0...|
    |          1.0|    1.0|    24.0|    2.0|    3.0|3758.0|    2.0|       0.0|        1.0|       1.0|       0.0|              3.0|   3.0|23.0|       2.0|      0.0|    0.0|       0.0|       0.0|     0.0|    0.0|(20,[0,1,2,3,4,5,...|
    |          1.0|    0.0|    11.0|    4.0|    0.0|3905.0|    0.0|       2.0|        2.0|       2.0|       0.0|              1.0|   0.0|36.0|       2.0|      0.0|    1.0|       2.0|       1.0|     0.0|    0.0|(20,[1,2,4,6,7,8,...|
    |          1.0|    0.0|    30.0|    4.0|    1.0|6187.0|    1.0|       3.0|        1.0|       3.0|       0.0|              3.0|   2.0|24.0|       2.0|      0.0|    1.0|       2.0|       0.0|     0.0|    0.0|[0.0,30.0,4.0,1.0...|
    |          1.0|    0.0|     6.0|    4.0|    3.0|1957.0|    0.0|       3.0|        1.0|       1.0|       0.0|              3.0|   2.0|31.0|       2.0|      1.0|    0.0|       2.0|       0.0|     0.0|    0.0|[0.0,6.0,4.0,3.0,...|
    |          1.0|    1.0|    48.0|    3.0|   10.0|7582.0|    1.0|       0.0|        2.0|       2.0|       0.0|              3.0|   3.0|31.0|       2.0|      1.0|    0.0|       3.0|       0.0|     1.0|    0.0|[1.0,48.0,3.0,10....|
    |          1.0|    0.0|    18.0|    2.0|    3.0|1936.0|    4.0|       3.0|        2.0|       3.0|       0.0|              3.0|   2.0|23.0|       2.0|      0.0|    1.0|       1.0|       0.0|     0.0|    0.0|[0.0,18.0,2.0,3.0...|
    |          1.0|    0.0|     6.0|    2.0|    3.0|2647.0|    2.0|       2.0|        2.0|       2.0|       0.0|              2.0|   0.0|44.0|       2.0|      0.0|    0.0|       2.0|       1.0|     0.0|    0.0|[0.0,6.0,2.0,3.0,...|
    |          1.0|    0.0|    11.0|    4.0|    0.0|3939.0|    0.0|       2.0|        1.0|       2.0|       0.0|              1.0|   0.0|40.0|       2.0|      1.0|    1.0|       1.0|       1.0|     0.0|    0.0|[0.0,11.0,4.0,0.0...|
    |          1.0|    1.0|    18.0|    2.0|    3.0|3213.0|    2.0|       1.0|        1.0|       3.0|       0.0|              2.0|   0.0|25.0|       2.0|      0.0|    0.0|       2.0|       0.0|     0.0|    0.0|[1.0,18.0,2.0,3.0...|
    |          1.0|    1.0|    36.0|    4.0|    3.0|2337.0|    0.0|       4.0|        4.0|       2.0|       0.0|              3.0|   0.0|36.0|       2.0|      1.0|    0.0|       2.0|       0.0|     0.0|    0.0|[1.0,36.0,4.0,3.0...|
    |          1.0|    3.0|    11.0|    4.0|    0.0|7228.0|    0.0|       2.0|        1.0|       2.0|       0.0|              3.0|   1.0|39.0|       2.0|      1.0|    1.0|       1.0|       0.0|     0.0|    0.0|[3.0,11.0,4.0,0.0...|
    +-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+--------------------+
    only showing top 20 rows
    
    +-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+--------------------+-----+
    |creditability|balance|duration|history|purpose|amount|savings|employment|instPercent|sexMarried|guarantors|residenceDuration|assets| age|concCredit|apartment|credits|occupation|dependents|hasPhone|foreign|            features|label|
    +-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+--------------------+-----+
    |          1.0|    0.0|    18.0|    4.0|    2.0|1049.0|    0.0|       1.0|        4.0|       1.0|       0.0|              3.0|   1.0|21.0|       2.0|      0.0|    0.0|       2.0|       0.0|     0.0|    0.0|(20,[1,2,3,4,6,7,...|  0.0|
    |          1.0|    0.0|     9.0|    4.0|    0.0|2799.0|    0.0|       2.0|        2.0|       2.0|       0.0|              1.0|   0.0|36.0|       2.0|      0.0|    1.0|       2.0|       1.0|     0.0|    0.0|(20,[1,2,4,6,7,8,...|  0.0|
    |          1.0|    1.0|    12.0|    2.0|    9.0| 841.0|    1.0|       3.0|        2.0|       1.0|       0.0|              3.0|   0.0|23.0|       2.0|      0.0|    0.0|       1.0|       0.0|     0.0|    0.0|[1.0,12.0,2.0,9.0...|  0.0|
    |          1.0|    0.0|    12.0|    4.0|    0.0|2122.0|    0.0|       2.0|        3.0|       2.0|       0.0|              1.0|   0.0|39.0|       2.0|      0.0|    1.0|       1.0|       1.0|     0.0|    1.0|[0.0,12.0,4.0,0.0...|  0.0|
    |          1.0|    0.0|    12.0|    4.0|    0.0|2171.0|    0.0|       2.0|        4.0|       2.0|       0.0|              3.0|   1.0|38.0|       0.0|      1.0|    1.0|       1.0|       0.0|     0.0|    1.0|[0.0,12.0,4.0,0.0...|  0.0|
    |          1.0|    0.0|    10.0|    4.0|    0.0|2241.0|    0.0|       1.0|        1.0|       2.0|       0.0|              2.0|   0.0|48.0|       2.0|      0.0|    1.0|       1.0|       1.0|     0.0|    1.0|[0.0,10.0,4.0,0.0...|  0.0|
    |          1.0|    0.0|     8.0|    4.0|    0.0|3398.0|    0.0|       3.0|        1.0|       2.0|       0.0|              3.0|   0.0|39.0|       2.0|      1.0|    1.0|       1.0|       0.0|     0.0|    1.0|[0.0,8.0,4.0,0.0,...|  0.0|
    |          1.0|    0.0|     6.0|    4.0|    0.0|1361.0|    0.0|       1.0|        2.0|       2.0|       0.0|              3.0|   0.0|40.0|       2.0|      1.0|    0.0|       1.0|       1.0|     0.0|    1.0|[0.0,6.0,4.0,0.0,...|  0.0|
    |          1.0|    3.0|    18.0|    4.0|    3.0|1098.0|    0.0|       0.0|        4.0|       1.0|       0.0|              3.0|   2.0|65.0|       2.0|      1.0|    1.0|       0.0|       0.0|     0.0|    0.0|[3.0,18.0,4.0,3.0...|  0.0|
    |          1.0|    1.0|    24.0|    2.0|    3.0|3758.0|    2.0|       0.0|        1.0|       1.0|       0.0|              3.0|   3.0|23.0|       2.0|      0.0|    0.0|       0.0|       0.0|     0.0|    0.0|(20,[0,1,2,3,4,5,...|  0.0|
    |          1.0|    0.0|    11.0|    4.0|    0.0|3905.0|    0.0|       2.0|        2.0|       2.0|       0.0|              1.0|   0.0|36.0|       2.0|      0.0|    1.0|       2.0|       1.0|     0.0|    0.0|(20,[1,2,4,6,7,8,...|  0.0|
    |          1.0|    0.0|    30.0|    4.0|    1.0|6187.0|    1.0|       3.0|        1.0|       3.0|       0.0|              3.0|   2.0|24.0|       2.0|      0.0|    1.0|       2.0|       0.0|     0.0|    0.0|[0.0,30.0,4.0,1.0...|  0.0|
    |          1.0|    0.0|     6.0|    4.0|    3.0|1957.0|    0.0|       3.0|        1.0|       1.0|       0.0|              3.0|   2.0|31.0|       2.0|      1.0|    0.0|       2.0|       0.0|     0.0|    0.0|[0.0,6.0,4.0,3.0,...|  0.0|
    |          1.0|    1.0|    48.0|    3.0|   10.0|7582.0|    1.0|       0.0|        2.0|       2.0|       0.0|              3.0|   3.0|31.0|       2.0|      1.0|    0.0|       3.0|       0.0|     1.0|    0.0|[1.0,48.0,3.0,10....|  0.0|
    |          1.0|    0.0|    18.0|    2.0|    3.0|1936.0|    4.0|       3.0|        2.0|       3.0|       0.0|              3.0|   2.0|23.0|       2.0|      0.0|    1.0|       1.0|       0.0|     0.0|    0.0|[0.0,18.0,2.0,3.0...|  0.0|
    |          1.0|    0.0|     6.0|    2.0|    3.0|2647.0|    2.0|       2.0|        2.0|       2.0|       0.0|              2.0|   0.0|44.0|       2.0|      0.0|    0.0|       2.0|       1.0|     0.0|    0.0|[0.0,6.0,2.0,3.0,...|  0.0|
    |          1.0|    0.0|    11.0|    4.0|    0.0|3939.0|    0.0|       2.0|        1.0|       2.0|       0.0|              1.0|   0.0|40.0|       2.0|      1.0|    1.0|       1.0|       1.0|     0.0|    0.0|[0.0,11.0,4.0,0.0...|  0.0|
    |          1.0|    1.0|    18.0|    2.0|    3.0|3213.0|    2.0|       1.0|        1.0|       3.0|       0.0|              2.0|   0.0|25.0|       2.0|      0.0|    0.0|       2.0|       0.0|     0.0|    0.0|[1.0,18.0,2.0,3.0...|  0.0|
    |          1.0|    1.0|    36.0|    4.0|    3.0|2337.0|    0.0|       4.0|        4.0|       2.0|       0.0|              3.0|   0.0|36.0|       2.0|      1.0|    0.0|       2.0|       0.0|     0.0|    0.0|[1.0,36.0,4.0,3.0...|  0.0|
    |          1.0|    3.0|    11.0|    4.0|    0.0|7228.0|    0.0|       2.0|        1.0|       2.0|       0.0|              3.0|   1.0|39.0|       2.0|      1.0|    1.0|       1.0|       0.0|     0.0|    0.0|[3.0,11.0,4.0,0.0...|  0.0|
    +-------------+-------+--------+-------+-------+------+-------+----------+-----------+----------+----------+-----------------+------+----+----------+---------+-------+----------+----------+--------+-------+--------------------+-----+
    only showing top 20 rows
    
    accuracy before pipeline fitting0.7264394897138242
    MSE: 0.22442244224422442
    MAE: 0.22442244224422442
    RMSE Squared: 0.47373245850820106
    R Squared: -0.1840018388690956
    Explained Variance: 0.09866135128364424
    
    accuracy after pipeline fitting0.7523847833582331
    RandomForestClassificationModel (uid=rfc_3146cd3eaaac) with 60 trees
    MSE: 0.23762376237623759
    MAE: 0.2376237623762376
    RMSE Squared: 0.48746667822143247
    R Squared: -0.25364900586139494
    Explained Variance: 0.15708699582829524
    
    
    Process finished with exit code 0

      从accuracy before pipeline fitting0.7264394897138242和accuracy after pipeline fitting0.7523847833582331可以看到,程序可以用管道训练得到的最优模型进行预测应用,将预测结果与标签做比较,预测结果取得了75.24%的准确率,而使用标签则取得了72.64的准确率。

      3.代码分析

      TODO

      

  • 相关阅读:
    manjaro开机出现grub 解决办法
    goquery 解析不了noscript
    同步服务器时间
    phpStorm中使用xdebug工具调试docker容器中的程序
    Goutte 获取http response
    在微信浏览器里使用js或jquery实现页面重新刷新
    Node Sass does not yet support your current environment
    微信支付服务商模式
    PHP获取月末时间
    JavaScript DOM 对象
  • 原文地址:https://www.cnblogs.com/BigJunOba/p/9358726.html
Copyright © 2011-2022 走看看