zoukankan      html  css  js  c++  java
  • spark MLlib的 pipeline方式

    spark mllib的pipeline,是指将多个机器学习的算法串联到一个工作链中,依次执行各种算法。

    在Pipeline中的每个算法被称为“PipelineStage”,表示其中的一个算法。PipelineStage分为两种类型,Estimator和Transformer,其中
    • Transformer将数据转换为两一种形式(例如修改格式),以供后续的Estimator使用,统一的转换函数transform;
    • Estimator是由数据得到一个Mode(Mode也是继承于Transformer),有统一触发的函数fit。

    然后一个“综合”的算法就可以通过pipeline封装起来。这样做的好处是可以很方便的替换算法。例如,我们在应用中往往只是笼统的期望一个“分类”、”拟合“这样的功能,但不知道是用分类或拟合的那个算法效果是最好的,有了这种pipeline机制后,很方便替换各种分类和拟合算法,从而得到最好的效果。

    /**
    * :: Experimental ::
    * A simple pipeline, which acts as an estimator. A Pipeline consists of a sequence of stages, each
    * of which is either an [[Estimator]] or a [[Transformer]]. When [[Pipeline#fit]] is called, the
    * stages are executed in order. If a stage is an [[Estimator]], its [[Estimator#fit]] method will
    * be called on the input dataset to fit a model. Then the model, which is a transformer, will be
    * used to transform the dataset as the input to the next stage. If a stage is a [[Transformer]],
    * its [[Transformer#transform]] method will be called to produce the dataset for the next stage.
    * The fitted model from a [[Pipeline]] is an [[PipelineModel]], which consists of fitted models and
    * transformers, corresponding to the pipeline stages. If there are no stages, the pipeline acts as
    * an identity transformer.
    */
    @Experimental
    class Pipeline(override val uid: String) extends Estimator[PipelineModel] {





  • 相关阅读:
    awk去重以某列重复的行
    awk 统计文件中按照某列统计某列的和(sum)
    使用jdk压缩war包
    histoty显示时间戳
    awk统计文件中某关键词出现次数
    Jbox帮助文档,默认的属性含义
    net之session漫谈及分布式session解决方案
    StackExchange.Redis 基本使用 (一) (转)
    Sql Server 表创建以及Ef浅谈
    数据验证(自定义特性)
  • 原文地址:https://www.cnblogs.com/zwCHAN/p/4633753.html
Copyright © 2011-2022 走看看