zoukankan      html  css  js  c++  java
  • Spark机器学习系列之13: 支持向量机SVM

    Spark 优缺点分析

    以下翻译自Scikit。 
    The advantages of support vector machines are: 
    (1)Effective in high dimensional spaces.在高维空间表现良好。 
    (2)Still effective in cases where number of dimensions is greater than the number of samples.在数据维度大于样本点数时候,依然可以起作用 
    (3)Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.仅仅使用训练数据的一个子集(支持向量),因此是内存友好型的算法。 
    (4)Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.适应广,能解决多种情况下的分类问题,这是由于它支持不同类型的核函数,甚至支持自定义的核函数。 
    能灵活使用多种核函数的确使得SVM变得非常强大,而核函数相关的理论很深奥,初步的探讨可参考转载的一篇文章:http://blog.csdn.net/qq_34531825/article/details/52895621 
    The disadvantages of support vector machines include: 
    (1)If the number of features is much greater than the number of samples, the method is likely to give poor performances.在数据维度(特征个数)多于样本数很多的时候,通常只能训练出一个表现很差的模型。 
    (2)SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).SVM不支持直接进行概率估计,Scikit中使用很耗费资源的5折交叉检验来估计概率。

    Spark Mllib

    这里写图片描述 
    这里写图片描述 
    这里写图片描述

    import org.apache.spark.{SparkConf,SparkContext}
    import org.apache.log4j.{Level, Logger}
    import org.apache.spark.mllib.classification.{SVMModel,SVMWithSGD}
    import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
    import org.apache.spark.mllib.util.MLUtils
    import org.apache.spark.SparkConf
    import org.apache.spark.SparkContext
    import org.apache.spark.mllib.optimization.L1Updater
    import org.apache.spark.mllib.optimization.SquaredL2Updater
    
    object mySVM {
      def main(args:Array[String]){
        //屏蔽日志
        Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)
        Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF) 
        val conf=new SparkConf().setMaster("local").setAppName("My App")
        val sc=new SparkContext(conf)
    
        // Load training data in LIBSVM format.
        val data = MLUtils.loadLibSVMFile(sc, "/data/mllib/sample_libsvm_data.txt")
        //println(data.collect()(0))//检查数据   
    
        // Split data into training (60%) and test (40%).
        val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
        val training = splits(0).cache()
        val test = splits(1)
    
        // Run training algorithm to build the model   
        /*
         * stepSize: 迭代步长,默认为1.0
         * numIterations: 迭代次数,默认为100
         * regParam: 正则化参数,默认值为0.0
         * miniBatchFraction: 每次迭代参与计算的样本比例,默认为1.0
         * gradient:HingeGradient (),梯度下降;
         * updater:SquaredL2Updater (),正则化,L2范数;
         * optimizer:GradientDescent (gradient, updater),梯度下降最优化计算。
         */
        val svmAlg=new SVMWithSGD()
        svmAlg.optimizer
          .setNumIterations(100)
          .setRegParam(0.1)//正则化参数
          .setUpdater(new L1Updater)
        val modelL1=svmAlg.run(training)
    
    
        // Clear the default threshold.
        modelL1.clearThreshold()
    
       // Compute raw scores on the test set.
        val scoreAndLabels = test.map { point =>
         val score = modelL1.predict(point.features)     
         (score, point.label)//return score and label
        }   
    
       // Get evaluation metrics.
       val metrics = new BinaryClassificationMetrics(scoreAndLabels)
       val auROC = metrics.areaUnderROC() 
       println("Area under ROC = " + auROC)  
    
      }   
    
    }

    参考文献 
    (1) 支持向量机通俗导论(理解SVM的三层境界) http://blog.csdn.net/v_july_v/article/details/7624837 
    (2)Spark MLlib SVM算法(源代码分析)http://www.itnose.net/detail/6267193.html 
    (3)Spark官网与Scikit官网 
    (4)LIBSVM: A  Library是 for Support Vector Machines , Chih-Chung Chang and Chih-Jen Lin,Department of Computer Science National Taiwan University, Taipei, Taiwan http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf 
    (5)A Tutorial on Support Vector Regression Alex J. Smola† and Bernhard Scholkopf 
    (6)如何解决机器学习中数据不平衡问题 http://www.zhaokv.com/2016/01/learning-from-imbalanced-data.html 
    (7)不平衡数据下的机器学习方法简介 http://www.jianshu.com/p/3e8b9f2764c8 
    (8)核函数 http://blog.csdn.net/xiaojiegege123456/article/details/7728198,xiaojiegege123456 CSDN 对核函数有一个比较容易的理解的介绍 
    (9)http://www.cnblogs.com/zxb1990/p/5098345.html SVM学习笔记(二)—-手写数字识别 贴近实际应用一些 
    (10)http://www.cnblogs.com/jerrylead/archive/2011/03/18/1988406.html 
    (11)RBF 参数 http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html

  • 相关阅读:
    Kubernetes 集成研发笔记
    Rust 1.44.0 发布
    Rust 1.43.0 发布
    PAT 甲级 1108 Finding Average (20分)
    PAT 甲级 1107 Social Clusters (30分)(并查集)
    PAT 甲级 1106 Lowest Price in Supply Chain (25分) (bfs)
    PAT 甲级 1105 Spiral Matrix (25分)(螺旋矩阵,简单模拟)
    PAT 甲级 1104 Sum of Number Segments (20分)(有坑,int *int 可能会溢出)
    java 多线程 26 : 线程池
    OpenCV_Python —— (4)形态学操作
  • 原文地址:https://www.cnblogs.com/itboys/p/8410566.html
Copyright © 2011-2022 走看看