zoukankan      html  css  js  c++  java
  • With our powers combined! xgboost and pipelearner

    @drsimonj here to show you how to use xgboost (extreme gradient boosting) models in pipelearner.

     Why a post on xgboost and pipelearner?

    xgboost is one of the most powerful machine-learning libraries, so there’s a good reason to use it. pipelearner helps to create machine-learning pipelines that make it easy to do cross-fold validation, hyperparameter grid searching, and more. So bringing them together will make for an awesome combination!

    The only problem - out of the box, xgboost doesn’t play nice with pipelearner. Let’s work out how to deal with this.

     Setup

    To follow this post you’ll need the following packages:

    # Install (if necessary)
    install.packages(c("xgboost", "tidyverse", "devtools"))
    devtools::install_github("drsimonj/pipelearner")
    
    # Attach
    library(tidyverse)
    library(xgboost)
    library(pipelearner)
    library(lazyeval)
    

    Our example will be to try and predict whether tumours are cancerous or not using the Breast Cancer Wisconsin (Diagnostic) Data Set. Set up as follows:

    data_url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'
    
    d <- read_csv(
      data_url,
      col_names = c('id', 'thinkness', 'size_uniformity',
                    'shape_uniformity', 'adhesion', 'epith_size',
                    'nuclei', 'chromatin', 'nucleoli', 'mitoses', 'cancer')) %>% 
      select(-id) %>%            # Remove id; not useful here
      filter(nuclei != '?') %>%  # Remove records with missing data
      mutate(cancer = cancer == 4) %>% # one-hot encode 'cancer' as 1=malignant;0=benign
      mutate_all(as.numeric)     # All to numeric; needed for XGBoost
    
    d
    #> # A tibble: 683 × 10
    #>    thinkness size_uniformity shape_uniformity adhesion epith_size nuclei
    #>        <dbl>           <dbl>            <dbl>    <dbl>      <dbl>  <dbl>
    #> 1          5               1                1        1          2      1
    #> 2          5               4                4        5          7     10
    #> 3          3               1                1        1          2      2
    #> 4          6               8                8        1          3      4
    #> 5          4               1                1        3          2      1
    #> 6          8              10               10        8          7     10
    #> 7          1               1                1        1          2     10
    #> 8          2               1                2        1          2      1
    #> 9          2               1                1        1          2      1
    #> 10         4               2                1        1          2      1
    #> # ... with 673 more rows, and 4 more variables: chromatin <dbl>,
    #> #   nucleoli <dbl>, mitoses <dbl>, cancer <dbl>
    

     pipelearner

    pipelearner makes it easy to do lots of routine machine learning tasks, many of which you can check out in this post. For this example, we’ll use pipelearner to perform a grid search of some xgboost hyperparameters.

    Grid searching is easy with pipelearner. For detailed instructions, check out my previous post: tidy grid search with pipelearner. As a quick reminder, we declare a data frame, machine learning function, formula, and hyperparameters as vectors. Here’s an example that would grid search multiple values of minsplit and maxdepth for an rpart decision tree:

    pipelearner(d, rpart::rpart, cancer ~ .,
                minsplit = c(2, 4, 6, 8, 10),
                maxdepth = c(2, 3, 4, 5))
    

    The challenge for xgboost:

    pipelearner expects a model function that has two arguments: data andformula

     xgboost

    Here’s an xgboost model:

    # Prep data (X) and labels (y)
    X <- select(d, -cancer) %>% as.matrix()
    y <- d$cancer
    
    # Fit the model
    fit <- xgboost(X, y, nrounds = 5, objective = "reg:logistic")
    #> [1]  train-rmse:0.372184 
    #> [2]  train-rmse:0.288560 
    #> [3]  train-rmse:0.230171 
    #> [4]  train-rmse:0.188965 
    #> [5]  train-rmse:0.158858
    
    # Examine accuracy
    predicted <- as.numeric(predict(fit, X) >= .5)
    mean(predicted == y)
    #> [1] 0.9838946
    

    Look like we have a model with 98.39% accuracy on the training data!

    Regardless, notice that first two arguments to xgboost() are a numeric data matrix and a numeric label vector. This is not what pipelearner wants!

     Wrapper function to parse data and formula

    To make xgboost compatible with pipelearner we need to write a wrapper function that accepts data and formula, and uses these to pass a feature matrix and label vector to xgboost:

    pl_xgboost <- function(data, formula, ...) {
      data <- as.data.frame(data)
    
      X_names <- as.character(f_rhs(formula))
      y_name  <- as.character(f_lhs(formula))
    
      if (X_names == '.') {
        X_names <- names(data)[names(data) != y_name]
      }
    
      X <- data.matrix(data[, X_names])
      y <- data[[y_name]]
    
      xgboost(data = X, label = y, ...)
    }
    

    Let’s try it out:

    pl_fit <- pl_xgboost(d, cancer ~ ., nrounds = 5, objective = "reg:logistic")
    #> [1]  train-rmse:0.372184 
    #> [2]  train-rmse:0.288560 
    #> [3]  train-rmse:0.230171 
    #> [4]  train-rmse:0.188965 
    #> [5]  train-rmse:0.158858
    
    # Examine accuracy
    pl_predicted <- as.numeric(predict(pl_fit, as.matrix(select(d, -cancer))) >= .5)
    mean(pl_predicted == y)
    #> [1] 0.9838946
    

    Perfect!

     Bringing it all together

    We can now use pipelearner and pl_xgboost() for easy grid searching:

    pl <- pipelearner(d, pl_xgboost, cancer ~ .,
                      nrounds = c(5, 10, 25),
                      eta = c(.1, .3),
                      max_depth = c(4, 6))
    
    fits <- pl %>% learn()
    #> [1]  train-rmse:0.453832 
    #> [2]  train-rmse:0.412548 
    #> ...
    
    fits
    #> # A tibble: 12 × 9
    #>    models.id cv_pairs.id train_p               fit target      model
    #>        <chr>       <chr>   <dbl>            <list>  <chr>      <chr>
    #> 1          1           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 2         10           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 3         11           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 4         12           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 5          2           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 6          3           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 7          4           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 8          5           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 9          6           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 10         7           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 11         8           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 12         9           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> # ... with 3 more variables: params <list>, train <list>, test <list>
    

    Looks like all the models learned OK. Let’s write a custom function to extract model accuracy and examine the results:

    accuracy <- function(fit, data, target_var) {
      # Convert resample object to data frame
      data <- as.data.frame(data)
      # Get feature matrix and labels
      X <- data %>%
        select(-matches(target_var)) %>% 
        as.matrix()
      y <- data[[target_var]]
      # Obtain predicted class
      y_hat <- as.numeric(predict(fit, X) > .5)
      # Return accuracy
      mean(y_hat == y)
    }
    
    results <- fits %>% 
      mutate(
        # hyperparameters
        nrounds   = map_dbl(params, "nrounds"),
        eta       = map_dbl(params, "eta"),
        max_depth = map_dbl(params, "max_depth"),
        # Accuracy
        accuracy_train = pmap_dbl(list(fit, train, target), accuracy),
        accuracy_test  = pmap_dbl(list(fit, test,  target), accuracy)
      ) %>% 
      # Select columns and order rows
      select(nrounds, eta, max_depth, contains("accuracy")) %>% 
      arrange(desc(accuracy_test), desc(accuracy_train))
    
    results
    #> # A tibble: 12 × 5
    #>    nrounds   eta max_depth accuracy_train accuracy_test
    #>      <dbl> <dbl>     <dbl>          <dbl>         <dbl>
    #> 1       25   0.3         6      1.0000000     0.9489051
    #> 2       25   0.3         4      1.0000000     0.9489051
    #> 3       10   0.3         6      0.9981685     0.9489051
    #> 4        5   0.3         6      0.9945055     0.9489051
    #> 5       10   0.1         6      0.9945055     0.9489051
    #> 6       25   0.1         6      0.9945055     0.9489051
    #> 7        5   0.1         6      0.9926740     0.9489051
    #> 8       25   0.1         4      0.9890110     0.9489051
    #> 9       10   0.3         4      0.9871795     0.9489051
    #> 10       5   0.3         4      0.9853480     0.9489051
    #> 11      10   0.1         4      0.9853480     0.9416058
    #> 12       5   0.1         4      0.9835165     0.9416058
    

    Our top model, which got 94.89% on a test set, had nrounds = 25, eta = 0.3, and max_depth = 6.

    Either way, the trick was the wrapper function pl_xgboost() that let us bridge xgboost and pipelearner. Note that this same principle can be used for any other machine learning functions that don’t play nice with pipelearner.

     Bonus: bootstrapped cross validation

    For those of you who are comfortable, below is a bonus example of using 100 boostrapped cross validation samples to examine consistency in the accuracy. It doesn’t get much easier than using pipelearner!

    results <- pipelearner(d, pl_xgboost, cancer ~ ., nrounds = 25) %>% 
      learn_cvpairs(n = 100) %>% 
      learn() %>% 
      mutate(
        test_accuracy  = pmap_dbl(list(fit, test,  target), accuracy)
      )
    #> [1]  train-rmse:0.357471 
    #> [2]  train-rmse:0.256735 
    #> ...
    
    results %>% 
      ggplot(aes(test_accuracy)) +
        geom_histogram(bins = 30) +
        scale_x_continuous(labels = scales::percent) +
        theme_minimal() +
        labs(x = "Accuracy", y = "Number of samples",
             title = "Test accuracy distribution for
    100 bootstrapped samples")
    

    unnamed-chunk-11-1.jpg

     Sign off

    Thanks for reading and I hope this was useful for you.

    For updates of recent blog posts, follow @drsimonj on Twitter, or email me atdrsimonjackson@gmail.com to get in touch.

    If you’d like the code that produced this blog, check out the blogR GitHub repository.

    转自:https://drsimonj.svbtle.com/with-our-powers-combined-xgboost-and-pipelearner

  • 相关阅读:
    连接查询
    使用聚合函数查询
    mysql 查询数据
    Mysql的基本操作
    MySQL的数据类型
    Mysql简介及安装教程
    客户端-服务端
    configparser模块
    反射
    class_method和static_method
  • 原文地址:https://www.cnblogs.com/payton/p/6375101.html
Copyright © 2011-2022 走看看