zoukankan      html  css  js  c++  java
  • SparkMLlib分类算法之支持向量机

    SparkMLlib分类算法之支持向量机

    (一),概念

      支持向量机(support vector machine)是一种分类算法,通过寻求结构化风险最小来提高学习机泛化能力,实现经验风险和置信范围的最小化,从而达到在统计样本量较少的情况下,亦能获得良好统计规律的目的。通俗来讲,它是一种二类分类模型,其基本模型定义为特征空间上的间隔最大的线性分类器,即支持向量机的学习策略便是间隔最大化,最终可转化为一个凸二次规划问题的求解。参考网址:http://www.cnblogs.com/end/p/3848740.html

    (二),SparkMLlib中SVM回归应用

    1,数据集:参考这篇SparkMLlib学习分类算法之逻辑回归算法

    2,处理数据及获取训练集和测试集

    val orig_file=sc.textFile("train_nohead.tsv")
        //println(orig_file.first())
        val data_file=orig_file.map(_.split("	")).map{
          r =>
            val trimmed =r.map(_.replace(""",""))
            val lable=trimmed(r.length-1).toDouble
            val feature=trimmed.slice(4,r.length-1).map(d => if(d=="?")0.0
            else d.toDouble)
            LabeledPoint(lable,Vectors.dense(feature))
        }
       /*特征标准化优化*/
        val vectors=data_file.map(x =>x.features)
        val rows=new RowMatrix(vectors)
        println(rows.computeColumnSummaryStatistics().variance)//每列的方差
        val scaler=new StandardScaler(withMean=true,withStd=true).fit(vectors)//标准化
        val scaled_data=data_file.map(point => LabeledPoint(point.label,scaler.transform(point.features)))
            .randomSplit(Array(0.7,0.3),11L)
        val data_train=scaled_data(0)
        val data_test=scaled_data(1)

    2,建立支持向量机模型及模型评估

     /*训练 SVM 模型**/
        val model_Svm=SVMWithSGD.train(data_train,numIteration)
    val correct_svm=data_test.map{
          point => if(model_Svm.predict(point.features)==point.label)
            1 else 0
        }.sum()/data_test.count()//精确度:0.6060885608856088
    val metrics=Seq(model_Svm).map{
          model =>
            val socreAndLabels=data_test.map {
              point => (model.predict(point.features), point.label)
            }
            val metrics=new BinaryClassificationMetrics(socreAndLabels)
            (model.getClass.getSimpleName,metrics.areaUnderPR(),metrics.areaUnderROC())
        }
    val allMetrics = metrics 
        allMetrics.foreach{ case (m, pr, roc) =>
          println(f"$m, Area under PR: ${pr * 100.0}%2.4f%%, Area under ROC: ${roc * 100.0}%2.4f%%")
        }
    /*
    SVMModel, Area under PR: 72.5527%, Area under ROC: 60.4180%*/

    3,模型参数调优

       逻辑回归(SGD)和 SVM 模型有相同的参数,原因是它们都使用随机梯度下降( SGD )作为基础优化技术。不同点在于二者采用的损失函数不同

    3.1 定义调参函数及模型评估函数

    /*调参函数*/
        def trainWithParams(input: RDD[LabeledPoint], regParam: Double,
                            numIterations: Int, updater: Updater, stepSize: Double) = {
          val svm = new SVMWithSGD
          svm.optimizer.setNumIterations(numIterations).
            setUpdater(updater).setRegParam(regParam).setStepSize(stepSize)
          svm.run(input)
        }
        /*评估函数*/
        def createMetrics(label: String, data: RDD[LabeledPoint], model:
        ClassificationModel) = {
          val scoreAndLabels = data.map { point =>
            (model.predict(point.features), point.label)
          }
          val metrics = new BinaryClassificationMetrics(scoreAndLabels)
          (label, metrics.areaUnderROC)
        }

    3.2  改变迭代次数(发现一旦完成特定次数的迭代,再增大迭代次数对结果的影响较小)

    val iterResults = Seq(1, 5, 10, 50).map { param =>
          val model = trainWithParams(data_train, 0.0, param, new
              SimpleUpdater, 1.0)
          createMetrics(s"$param iterations", data_test, model)
        }
        iterResults.foreach { case (param, auc) => println(f"$param, AUC = ${auc * 100}%2.2f%%") }
    /*
    1 iterations, AUC = 59.02%
    5 iterations, AUC = 60.04%
    10 iterations, AUC = 60.42%
    50 iterations, AUC = 60.42%
    */

    3.3 ,改变步长(以看出步长增长过大对性能有负面影响)

        在 SGD 中,在训练每个样本并更新模型的权重向量时,步长用来控制算法在最陡的梯度方向上应该前进多远。较大的步长收敛较快,但是步长太大可能导致收敛到局部最优解。

    val stepResults = Seq(0.001, 0.01, 0.1, 1.0, 10.0).map { param =>
          val model = trainWithParams(data_train, 0.0, numIteration, new
              SimpleUpdater, param)
          createMetrics(s"$param step size", data_test, model)
        }
        stepResults.foreach { case (param, auc) => println(f"$param, AUC = ${auc * 100}%2.2f%%") }
    /*
    0.001 step size, AUC = 59.02%
    0.01 step size, AUC = 59.02%
    0.1 step size, AUC = 59.01%
    1.0 step size, AUC = 60.42%
    10.0 step size, AUC = 56.09%
    
    */

    3.4 正则化 

    val regResults = Seq(0.001, 0.01, 0.1, 1.0, 10.0).map { param =>
          val model = trainWithParams(data_train, param, numIteration,
            new SquaredL2Updater, 1.0)
          createMetrics(s"$param L2 regularization parameter",
            data_test, model)
        }
        regResults.foreach { case (param, auc) => println(f"$param, AUC = ${auc * 100}%2.2f%%") }
    /*
    0.001 L2 regularization parameter, AUC = 60.42%
    0.01 L2 regularization parameter, AUC = 60.42%
    0.1 L2 regularization parameter, AUC = 60.37%
    1.0 L2 regularization parameter, AUC = 60.56%
    10.0 L2 regularization parameter, AUC = 41.54%
    */    
        可以看出,低等级的正则化对模型的性能影响不大。然而,增大正则化可以看到欠拟合会导致较低模型性能。
    (三),总结
        1,提高精确度感觉蛮难的,前提还是要先分析数据,对不同特征加以处理吧。。。。。
        2,以后多学习。。。。
  • 相关阅读:
    random模块学习笔记
    python3 控制结构知识及范例
    eclipse运行python 安装pydev 版本匹配问题
    接口自动化CSV文件生成超长随机字符串--java接口方法
    lucene 3.0 + 盘古分词 + 关键字高亮 + 分页的实现与demo
    Loading a Different jQuery Version for IE6-8
    选择排序和冒泡排序
    Bootstrap Tabs with AJAX snippet
    jquery.qrcode.js
    validator.w3.org for html5
  • 原文地址:https://www.cnblogs.com/ksWorld/p/6882591.html
Copyright © 2011-2022 走看看