zoukankan      html  css  js  c++  java
  • Spark 模型选择和调参

    Spark - ML Tuning

    官方文档:https://spark.apache.org/docs/2.2.0/ml-tuning.html

    这一章节主要讲述如何通过使用MLlib的工具来调试模型算法和pipeline,内置的交叉验证和其他工具允许用户优化模型和pipeline中的超参数;

    目录:

    • 模型选择,也就是调参;
    • 交叉验证;
    • 训练集、验证集划分;

    模型选择(调参)

    机器学习的一个重要工作就是模型选择,或者说根据给定任务使用数据来发现最优的模型和参数,也叫做调试,既可以针对单个模型进行调试,也可以针对整个pipeline的各个环节进行调试,使用者可以一次对整个pipeline进行调试而不是每次一个pipeline中的部分;

    MLlib支持CrossValidator和TrainValidationSplit等模型选择工具,这些工具需要下列参数:

    • Estimator:待调试的算法或者Pipeline;
    • 参数Map列表:用于搜索的参数空间;
    • Evaluator:衡量模型在集外测试集上表现的方法;

    这些工具工作方式如下:

    • 分割数据到训练集和测试集;
    • 对每一组训练&测试数据,应用所有参数空间中的可选参数组合:
      • 对每一组参数组合,使用其设置到算法上,得到对应的model,并验证该model的性能;
    • 选择得到最好性能的模型使用的参数组合;

    Evaluator针对回归问题可以是RegressionEvaluator,针对二分数据可以是BinaryClassificationEvaluator,针对多分类问题的MulticlassClassificationEvaluator,默认的验证方法可以通过setMetricName来修改;

    交叉验证

    CrossValidator首先将数据分到一个个的fold中,使用这些fold集合作为训练集和测试集,如果k=3,那么CrossValidator将生成3个(训练,测试)组合,也就是通过3个fold排列组合得到的,每一组使用2个fold作为训练集,另一个fold作为测试集,为了验证一个指定的参数组合,CrossValidator需要计算3个模型的平均性能,每个模型都是通过之前的一组训练&测试集训练得到;

    确认了最佳参数后,CrossValidator最终会使用全部数据和最佳参数组合来重新训练预测;

    例子:通过交叉验证进行模型选择;

    注意:交叉验证在整个参数网格上是十分耗时的,下面的例子中,参数网格中numFeatures有3个可取值,regParam有2个可取值,CrossValidator使用2个fold,这将会训练3*2*2个不同的模型,在实际工作中,通常会设置更多的参数、更多的参数取值以及更多的fold,换句话说,CrossValidator本身就是十分奢侈的,无论如何,与手工调试相比,它依然是一种更加合理和自动化的调参手段;

    from pyspark.ml import Pipeline
    from pyspark.ml.classification import LogisticRegression
    from pyspark.ml.evaluation import BinaryClassificationEvaluator
    from pyspark.ml.feature import HashingTF, Tokenizer
    from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
    
    # Prepare training documents, which are labeled.
    training = spark.createDataFrame([
        (0, "a b c d e spark", 1.0),
        (1, "b d", 0.0),
        (2, "spark f g h", 1.0),
        (3, "hadoop mapreduce", 0.0),
        (4, "b spark who", 1.0),
        (5, "g d a y", 0.0),
        (6, "spark fly", 1.0),
        (7, "was mapreduce", 0.0),
        (8, "e spark program", 1.0),
        (9, "a e c l", 0.0),
        (10, "spark compile", 1.0),
        (11, "hadoop software", 0.0)
    ], ["id", "text", "label"])
    
    # Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and lr.
    tokenizer = Tokenizer(inputCol="text", outputCol="words")
    hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
    lr = LogisticRegression(maxIter=10)
    pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
    
    # We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.
    # This will allow us to jointly choose parameters for all Pipeline stages.
    # A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
    # We use a ParamGridBuilder to construct a grid of parameters to search over.
    # With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,
    # this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.
    paramGrid = ParamGridBuilder() 
        .addGrid(hashingTF.numFeatures, [10, 100, 1000]) 
        .addGrid(lr.regParam, [0.1, 0.01]) 
        .build()
    
    crossval = CrossValidator(estimator=pipeline,
                              estimatorParamMaps=paramGrid,
                              evaluator=BinaryClassificationEvaluator(),
                              numFolds=2)  # use 3+ folds in practice
    
    # Run cross-validation, and choose the best set of parameters.
    cvModel = crossval.fit(training)
    
    # Prepare test documents, which are unlabeled.
    test = spark.createDataFrame([
        (4, "spark i j k"),
        (5, "l m n"),
        (6, "mapreduce spark"),
        (7, "apache hadoop")
    ], ["id", "text"])
    
    # Make predictions on test documents. cvModel uses the best model found (lrModel).
    prediction = cvModel.transform(test)
    selected = prediction.select("id", "text", "probability", "prediction")
    for row in selected.collect():
        print(row)
    

    划分训练、验证集

    对于超参数调试,Spark还支持TrainValidationSplit,它一次只能验证一组参数,这与CrossValidator一次进行k次截然不同,因此它更加快速,但是如果训练集不够大的化就无法得到一个真实的结果;

    不像是CrossValidator,TrainValidationSplit创建一个训练、测试组合,它根据trainRatio将数据分为两部分,假设trainRatio=0.75,那么数据集的75%作为训练集,25%用于验证;

    与CrossValidator类似的是,TrainValidationSplit最终也会使用最佳参数和全部数据来训练一个预测器;

    from pyspark.ml.evaluation import RegressionEvaluator
    from pyspark.ml.regression import LinearRegression
    from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
    
    # Prepare training and test data.
    data = spark.read.format("libsvm")
        .load("data/mllib/sample_linear_regression_data.txt")
    train, test = data.randomSplit([0.9, 0.1], seed=12345)
    
    lr = LinearRegression(maxIter=10)
    
    # We use a ParamGridBuilder to construct a grid of parameters to search over.
    # TrainValidationSplit will try all combinations of values and determine best model using
    # the evaluator.
    paramGrid = ParamGridBuilder()
        .addGrid(lr.regParam, [0.1, 0.01]) 
        .addGrid(lr.fitIntercept, [False, True])
        .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])
        .build()
    
    # In this case the estimator is simply the linear regression.
    # A TrainValidationSplit requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
    tvs = TrainValidationSplit(estimator=lr,
                               estimatorParamMaps=paramGrid,
                               evaluator=RegressionEvaluator(),
                               # 80% of the data will be used for training, 20% for validation.
                               trainRatio=0.8)
    
    # Run TrainValidationSplit, and choose the best set of parameters.
    model = tvs.fit(train)
    
    # Make predictions on test data. model is the model with combination of parameters
    # that performed best.
    model.transform(test)
        .select("features", "label", "prediction")
        .show()
    
  • 相关阅读:
    算法打基础——符号&递归解法
    算法打基础——算法基本分析
    最小生成树——Kruskal算法
    最小生成树——Prim算法
    物理DG主备库切换时遇到ORA-16139: media recovery required错误
    Dataguard 主库与备库的Service_Name 不一致时,如何配置客户端TNSName
    oracle 11g RAC 在Windows 7下安装
    关于存储大小的计量单位
    老家的亲戚关系
    Unity3D学习笔记——NGUI之UIInput
  • 原文地址:https://www.cnblogs.com/helongBlog/p/13743612.html
Copyright © 2011-2022 走看看