zoukankan      html  css  js  c++  java
  • mahout分类

    分类看起来比聚类和推荐麻烦多了

    分类算法与聚类和推荐算法的不同:必须是有明确结果的,必须是有监督的,主要用于预测和检测

    Mahout的优势 mahout分类算法对资源的要求不会快于训练数据和测试数据的增长速度,而且可以转换为分布式应用(数据规模如果不够大 Mahout表现可能不及其他类型的系统)

    关键词表:

    Key idea

    Description

    Model

    A computer program that makes decisions; in classification, the output of the training algorithm is a model

    Training Data

    Subset of training examples labeled with the value of the target variable and used as input to the learning algorithm to produce the model

    Test Data

    Withheld portion of training examples given to the model without the value for the target variable (although the value is known) and used to evaluate the model

    Training

    Learning process that uses training data to produce a model. That model can then compute estimates of the target variable given the predictor variables as inputs.

    Training example

    Entity with features that will be used as input for learning algorithm

    Feature

    A known characteristic of a training or new example; a “feature” is equivalent to saying a “characteristic”.

    Variable

    In this context, a variable is equivalent to a the value of a feature or a function of several features. This usage is somewhat different from a normal variable in a computer program.

    Record

    A container where an example is stored; such a record is composed of fields.

    Field

    Part of a record that contains the value of a feature (variable)

    Predictor variable

    Feature selected for use as input for a classification model. Not all features need be used. Some features may be algorithmic combinations of other features.

    Target variable

    Feature that the classification model is attempting to estimate: the target variable is categorical and its determination is the aim of the classification system.

    一般来说:80-90%的数据作为training Data 其他数据作为Test Data数据

     

    Mahout分类中的四种数据类型

     

    Type of Value

    Description

    Continuous

    This type of value is a floating point value. This might be a price, a weight, a time, a value or anything else that has a numerical magnitude and where this magnitude is the key property of the value.

    Categorical

    A categorical value can have one of a pre-specified set of values. Typically the set of categorical values is relatively small and may be as small as two, although the set can be quite large. Boolean values are generally treated as categorical values. Another example might be a vendor id.

    Word-like

    A word-like value is like a categorical value, but it has an open-ended set of possible values.

    Text-like

    A text-like value is a sequence of word-like values, all of the same kind. Text is the classic example of a text-like value, but a list of email addresses or URL’s is also text-like.

     

    数据类型

    Name

    Type

    Value

    from-address

    word-like

    George <george@fumble-tech.com>

    in-address-book?

    categorical(TRUE, FALSE)

    TRUE

    non-spam-words

    text-like

    “Ted”, “Mahout”, “User”, “lunch”

    spam-words

    text-like

    “available”

    unknown-words

    continuous

    0

    message-length

    continuous

    31

     

    分类的应用步骤

    Stage

    Step

    1. Training the model

    Define target variable

    Collect historical data

    Define predictor variables

    Select a learning algorithm

    Use learning algorithm to train model

    2. Evaluating the model

    Run test data

    Adjust input (different predictor variables and/or algorithm)

    3. Using model in production

    Input the new examples to estimate unknown target values

    Retrain model as needed

     

    Mahout的命令行工具

    $ $MAHOUT_HOME/bin/mahout

    An example program must be given as the first argument.

    Valid program names are:

    canopy: : Canopy clustering

    cat : Print a file or resource as the logistic regression models would see it

    ...

    runlogistic : Run a logistic regression model against CSV data

    ...

    trainlogistic : Train a logistic regression using stochastic gradient descent

     

    demo

    cat 查看一个文件

    $ bin/mahout cat donut.csv

    "x","y","shape","color","k","k0","xx","xy","yy","a","b","c","bias"

    0.923307513352484,0.0135197141207755,21,2,4,8,0.852496764213146,...,1

    0.711011884035543,0.909141522599384,22,2,3,9,0.505537899239772,...,1

    ...

    0.67132937326096,0.571220482233912,23,1,5,2,0.450683127402953,...,1

    0.548616112209857,0.405350996181369,24,1,5,3,0.300979638576258,...,1

    0.677980388281867,0.993355110753328,25,2,3,9,0.459657406894831,...,1

    $

     

    Trainlogistic:根据数据训练生成model

    $ $MAHOUT_HOME/bin/mahout trainlogistic --input donut.csv

    --output ./model

    --target color --categories 2

    --predictors x y --types numeric

    --features 20 --passes 100 --rate 50

    ...

    color ~ -0.157*Intercept Term + -0.678*x + -0.416*y

    Intercept Term -0.15655

    x -0.67841

    y -0.41587

    ...

    Option

    Whatit Does

    --quiet

    Produce less status and progress output

    --input <file-or-resource>

    Use the specified file or resource as input

    --output <file-for-model>

    Put the model into the specified file

    --target <variable>

    Use the specified variable as the target

    --categories <n>

    How many categories does the target variable have?

    --predictors <v1> ... <vn>

    A list of the names of the predictor variables

    --types <t1> ... <tm>

    A list of the types of the predictor variables. Each type should be one of numeric, word or text. Types can be abbreviated to their first letter. If too few types are given, the last one is used again as necessary. Use word for categorical variables.

    --passes

    The number of times the input data should be re-examined during training. Small input files may need to be examined dozens of times. Very large input files probably don’t even need to be completely examined

    --lambda

    Controls how much the algorithm tries to eliminate variables from the final model. A value of 0 indicates no effort is made. Typical values are on the order of 0.00001 or less.

    --rate

    The initial learning rate. This can be large if you have lots of data or use lots of passes because it is decreased progressively as data is examined.

    --noBias

    Do not use the built-in constant in the model (this eliminates the Intercept Term from the model. Occasionally this is a good idea, but generally it is not since the SGD learning algorithm can usually eliminate the intercept term if warranted.

    --features

    The size of the internal feature vector to use in building the model. A larger value here can be helpful, especially with text-like input data.

     

    Runlogistic model评价

    $ bin/mahout runlogistic --input donut.csv --model ./model

    --auc --confusion

    AUC = 0.57

    confusion: [[27.0, 13.0], [0.0, 0.0]]

    AUC confusion表示分类准确率 AUCreadingData的正确率越接近1越好) confusion(识别率和误识率)

     

    参数说明

    Option

    What it Does

    --quiet

    Produce less status and progress output

    --auc

    Print out AUC score for model versus input data after reading data

    --scores

    Print target variable value and scores for each input example

    --threshold <t>

    Set the threshold for confusion matrix computation to t (default 0.5)

    --confusion

    Print out confusion matrix for a particular threshold (See --threshold)

    --input <input>

    Read data records from specified file or resource

    --model <model>

    Read model from specified file

    分类看起来比聚类和推荐麻烦多了

    分类算法与聚类和推荐算法的不同:必须是有明确结果的,必须是有监督的,主要用于预测和检测

    Mahout的优势 mahout分类算法对资源的要求不会快于训练数据和测试数据的增长速度,而且可以转换为分布式应用(数据规模如果不够大 Mahout表现可能不及其他类型的系统)

    关键词表:

    Key idea

    Description

    Model

    A computer program that makes decisions; in classification, the output of the training algorithm is a model

    Training Data

    Subset of training examples labeled with the value of the target variable and used as input to the learning algorithm to produce the model

    Test Data

    Withheld portion of training examples given to the model without the value for the target variable (although the value is known) and used to evaluate the model

    Training

    Learning process that uses training data to produce a model. That model can then compute estimates of the target variable given the predictor variables as inputs.

    Training example

    Entity with features that will be used as input for learning algorithm

    Feature

    A known characteristic of a training or new example; a “feature” is equivalent to saying a “characteristic”.

    Variable

    In this context, a variable is equivalent to a the value of a feature or a function of several features. This usage is somewhat different from a normal variable in a computer program.

    Record

    A container where an example is stored; such a record is composed of fields.

    Field

    Part of a record that contains the value of a feature (variable)

    Predictor variable

    Feature selected for use as input for a classification model. Not all features need be used. Some features may be algorithmic combinations of other features.

    Target variable

    Feature that the classification model is attempting to estimate: the target variable is categorical and its determination is the aim of the classification system.

    一般来说:80-90%的数据作为training Data 其他数据作为Test Data数据

     

    Mahout分类中的四种数据类型

     

    Type of Value

    Description

    Continuous

    This type of value is a floating point value. This might be a price, a weight, a time, a value or anything else that has a numerical magnitude and where this magnitude is the key property of the value.

    Categorical

    A categorical value can have one of a pre-specified set of values. Typically the set of categorical values is relatively small and may be as small as two, although the set can be quite large. Boolean values are generally treated as categorical values. Another example might be a vendor id.

    Word-like

    A word-like value is like a categorical value, but it has an open-ended set of possible values.

    Text-like

    A text-like value is a sequence of word-like values, all of the same kind. Text is the classic example of a text-like value, but a list of email addresses or URL’s is also text-like.

     

    数据类型

    Name

    Type

    Value

    from-address

    word-like

    George <george@fumble-tech.com>

    in-address-book?

    categorical(TRUE, FALSE)

    TRUE

    non-spam-words

    text-like

    “Ted”, “Mahout”, “User”, “lunch”

    spam-words

    text-like

    “available”

    unknown-words

    continuous

    0

    message-length

    continuous

    31

     

    分类的应用步骤

    Stage

    Step

    1. Training the model

    Define target variable

    Collect historical data

    Define predictor variables

    Select a learning algorithm

    Use learning algorithm to train model

    2. Evaluating the model

    Run test data

    Adjust input (different predictor variables and/or algorithm)

    3. Using model in production

    Input the new examples to estimate unknown target values

    Retrain model as needed

     

    Mahout的命令行工具

    $ $MAHOUT_HOME/bin/mahout

    An example program must be given as the first argument.

    Valid program names are:

    canopy: : Canopy clustering

    cat : Print a file or resource as the logistic regression models would see it

    ...

    runlogistic : Run a logistic regression model against CSV data

    ...

    trainlogistic : Train a logistic regression using stochastic gradient descent

     

    demo

    cat 查看一个文件

    $ bin/mahout cat donut.csv

    "x","y","shape","color","k","k0","xx","xy","yy","a","b","c","bias"

    0.923307513352484,0.0135197141207755,21,2,4,8,0.852496764213146,...,1

    0.711011884035543,0.909141522599384,22,2,3,9,0.505537899239772,...,1

    ...

    0.67132937326096,0.571220482233912,23,1,5,2,0.450683127402953,...,1

    0.548616112209857,0.405350996181369,24,1,5,3,0.300979638576258,...,1

    0.677980388281867,0.993355110753328,25,2,3,9,0.459657406894831,...,1

    $

     

    Trainlogistic:根据数据训练生成model

    $ $MAHOUT_HOME/bin/mahout trainlogistic --input donut.csv

    --output ./model

    --target color --categories 2

    --predictors x y --types numeric

    --features 20 --passes 100 --rate 50

    ...

    color ~ -0.157*Intercept Term + -0.678*x + -0.416*y

    Intercept Term -0.15655

    x -0.67841

    y -0.41587

    ...

    Option

    Whatit Does

    --quiet

    Produce less status and progress output

    --input <file-or-resource>

    Use the specified file or resource as input

    --output <file-for-model>

    Put the model into the specified file

    --target <variable>

    Use the specified variable as the target

    --categories <n>

    How many categories does the target variable have?

    --predictors <v1> ... <vn>

    A list of the names of the predictor variables

    --types <t1> ... <tm>

    A list of the types of the predictor variables. Each type should be one of numeric, word or text. Types can be abbreviated to their first letter. If too few types are given, the last one is used again as necessary. Use word for categorical variables.

    --passes

    The number of times the input data should be re-examined during training. Small input files may need to be examined dozens of times. Very large input files probably don’t even need to be completely examined

    --lambda

    Controls how much the algorithm tries to eliminate variables from the final model. A value of 0 indicates no effort is made. Typical values are on the order of 0.00001 or less.

    --rate

    The initial learning rate. This can be large if you have lots of data or use lots of passes because it is decreased progressively as data is examined.

    --noBias

    Do not use the built-in constant in the model (this eliminates the Intercept Term from the model. Occasionally this is a good idea, but generally it is not since the SGD learning algorithm can usually eliminate the intercept term if warranted.

    --features

    The size of the internal feature vector to use in building the model. A larger value here can be helpful, especially with text-like input data.

     

    Runlogistic model评价

    $ bin/mahout runlogistic --input donut.csv --model ./model

    --auc --confusion

    AUC = 0.57

    confusion: [[27.0, 13.0], [0.0, 0.0]]

    AUC confusion表示分类准确率 AUCreadingData的正确率越接近1越好) confusion(识别率和误识率)

     

    参数说明

    Option

    What it Does

    --quiet

    Produce less status and progress output

    --auc

    Print out AUC score for model versus input data after reading data

    --scores

    Print target variable value and scores for each input example

    --threshold <t>

    Set the threshold for confusion matrix computation to t (default 0.5)

    --confusion

    Print out confusion matrix for a particular threshold (See --threshold)

    --input <input>

    Read data records from specified file or resource

    --model <model>

    Read model from specified file

  • 相关阅读:
    高仿IOS下拉刷新的粘虫效果
    CSDN无耻,亿赛通无耻
    2014年10月Android面试总结
    HttpClient和HttpURLConnection的使用和区别(下)
    HttpClient和HttpURLConnection的使用和区别(上)
    Gson简单使用
    Android PowerImageView实现,可以播放动画的强大ImageView
    Android访问网络,使用HttpURLConnection还是HttpClient?
    自定义android ProgressDialog
    NPOI导Excel样式设置
  • 原文地址:https://www.cnblogs.com/batys/p/3295942.html
Copyright © 2011-2022 走看看