  • With our powers combined! xgboost and pipelearner

    @drsimonj here to show you how to use xgboost (extreme gradient boosting) models in pipelearner.

     Why a post on xgboost and pipelearner?

    xgboost is one of the most powerful machine-learning libraries, so there’s a good reason to use it. pipelearner helps to create machine-learning pipelines that make it easy to do cross-fold validation, hyperparameter grid searching, and more. So bringing them together will make for an awesome combination!

    The only problem - out of the box, xgboost doesn’t play nice with pipelearner. Let’s work out how to deal with this.


    To follow this post you’ll need the following packages:

    # Install (if necessary)
    install.packages(c("xgboost", "tidyverse", "devtools"))
    # Attach

    Our example will be to try and predict whether tumours are cancerous or not using the Breast Cancer Wisconsin (Diagnostic) Data Set. Set up as follows:

    data_url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'
    d <- read_csv(
      col_names = c('id', 'thinkness', 'size_uniformity',
                    'shape_uniformity', 'adhesion', 'epith_size',
                    'nuclei', 'chromatin', 'nucleoli', 'mitoses', 'cancer')) %>% 
      select(-id) %>%            # Remove id; not useful here
      filter(nuclei != '?') %>%  # Remove records with missing data
      mutate(cancer = cancer == 4) %>% # one-hot encode 'cancer' as 1=malignant;0=benign
      mutate_all(as.numeric)     # All to numeric; needed for XGBoost
    #> # A tibble: 683 × 10
    #>    thinkness size_uniformity shape_uniformity adhesion epith_size nuclei
    #>        <dbl>           <dbl>            <dbl>    <dbl>      <dbl>  <dbl>
    #> 1          5               1                1        1          2      1
    #> 2          5               4                4        5          7     10
    #> 3          3               1                1        1          2      2
    #> 4          6               8                8        1          3      4
    #> 5          4               1                1        3          2      1
    #> 6          8              10               10        8          7     10
    #> 7          1               1                1        1          2     10
    #> 8          2               1                2        1          2      1
    #> 9          2               1                1        1          2      1
    #> 10         4               2                1        1          2      1
    #> # ... with 673 more rows, and 4 more variables: chromatin <dbl>,
    #> #   nucleoli <dbl>, mitoses <dbl>, cancer <dbl>


    pipelearner makes it easy to do lots of routine machine learning tasks, many of which you can check out in this post. For this example, we’ll use pipelearner to perform a grid search of some xgboost hyperparameters.

    Grid searching is easy with pipelearner. For detailed instructions, check out my previous post: tidy grid search with pipelearner. As a quick reminder, we declare a data frame, machine learning function, formula, and hyperparameters as vectors. Here’s an example that would grid search multiple values of minsplit and maxdepth for an rpart decision tree:

    pipelearner(d, rpart::rpart, cancer ~ .,
                minsplit = c(2, 4, 6, 8, 10),
                maxdepth = c(2, 3, 4, 5))

    The challenge for xgboost:

    pipelearner expects a model function that has two arguments: data andformula


    Here’s an xgboost model:

    # Prep data (X) and labels (y)
    X <- select(d, -cancer) %>% as.matrix()
    y <- d$cancer
    # Fit the model
    fit <- xgboost(X, y, nrounds = 5, objective = "reg:logistic")
    #> [1]  train-rmse:0.372184 
    #> [2]  train-rmse:0.288560 
    #> [3]  train-rmse:0.230171 
    #> [4]  train-rmse:0.188965 
    #> [5]  train-rmse:0.158858
    # Examine accuracy
    predicted <- as.numeric(predict(fit, X) >= .5)
    mean(predicted == y)
    #> [1] 0.9838946

    Look like we have a model with 98.39% accuracy on the training data!

    Regardless, notice that first two arguments to xgboost() are a numeric data matrix and a numeric label vector. This is not what pipelearner wants!

     Wrapper function to parse data and formula

    To make xgboost compatible with pipelearner we need to write a wrapper function that accepts data and formula, and uses these to pass a feature matrix and label vector to xgboost:

    pl_xgboost <- function(data, formula, ...) {
      data <- as.data.frame(data)
      X_names <- as.character(f_rhs(formula))
      y_name  <- as.character(f_lhs(formula))
      if (X_names == '.') {
        X_names <- names(data)[names(data) != y_name]
      X <- data.matrix(data[, X_names])
      y <- data[[y_name]]
      xgboost(data = X, label = y, ...)

    Let’s try it out:

    pl_fit <- pl_xgboost(d, cancer ~ ., nrounds = 5, objective = "reg:logistic")
    #> [1]  train-rmse:0.372184 
    #> [2]  train-rmse:0.288560 
    #> [3]  train-rmse:0.230171 
    #> [4]  train-rmse:0.188965 
    #> [5]  train-rmse:0.158858
    # Examine accuracy
    pl_predicted <- as.numeric(predict(pl_fit, as.matrix(select(d, -cancer))) >= .5)
    mean(pl_predicted == y)
    #> [1] 0.9838946


     Bringing it all together

    We can now use pipelearner and pl_xgboost() for easy grid searching:

    pl <- pipelearner(d, pl_xgboost, cancer ~ .,
                      nrounds = c(5, 10, 25),
                      eta = c(.1, .3),
                      max_depth = c(4, 6))
    fits <- pl %>% learn()
    #> [1]  train-rmse:0.453832 
    #> [2]  train-rmse:0.412548 
    #> ...
    #> # A tibble: 12 × 9
    #>    models.id cv_pairs.id train_p               fit target      model
    #>        <chr>       <chr>   <dbl>            <list>  <chr>      <chr>
    #> 1          1           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 2         10           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 3         11           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 4         12           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 5          2           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 6          3           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 7          4           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 8          5           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 9          6           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 10         7           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 11         8           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> 12         9           1       1 <S3: xgb.Booster> cancer pl_xgboost
    #> # ... with 3 more variables: params <list>, train <list>, test <list>

    Looks like all the models learned OK. Let’s write a custom function to extract model accuracy and examine the results:

    accuracy <- function(fit, data, target_var) {
      # Convert resample object to data frame
      data <- as.data.frame(data)
      # Get feature matrix and labels
      X <- data %>%
        select(-matches(target_var)) %>% 
      y <- data[[target_var]]
      # Obtain predicted class
      y_hat <- as.numeric(predict(fit, X) > .5)
      # Return accuracy
      mean(y_hat == y)
    results <- fits %>% 
        # hyperparameters
        nrounds   = map_dbl(params, "nrounds"),
        eta       = map_dbl(params, "eta"),
        max_depth = map_dbl(params, "max_depth"),
        # Accuracy
        accuracy_train = pmap_dbl(list(fit, train, target), accuracy),
        accuracy_test  = pmap_dbl(list(fit, test,  target), accuracy)
      ) %>% 
      # Select columns and order rows
      select(nrounds, eta, max_depth, contains("accuracy")) %>% 
      arrange(desc(accuracy_test), desc(accuracy_train))
    #> # A tibble: 12 × 5
    #>    nrounds   eta max_depth accuracy_train accuracy_test
    #>      <dbl> <dbl>     <dbl>          <dbl>         <dbl>
    #> 1       25   0.3         6      1.0000000     0.9489051
    #> 2       25   0.3         4      1.0000000     0.9489051
    #> 3       10   0.3         6      0.9981685     0.9489051
    #> 4        5   0.3         6      0.9945055     0.9489051
    #> 5       10   0.1         6      0.9945055     0.9489051
    #> 6       25   0.1         6      0.9945055     0.9489051
    #> 7        5   0.1         6      0.9926740     0.9489051
    #> 8       25   0.1         4      0.9890110     0.9489051
    #> 9       10   0.3         4      0.9871795     0.9489051
    #> 10       5   0.3         4      0.9853480     0.9489051
    #> 11      10   0.1         4      0.9853480     0.9416058
    #> 12       5   0.1         4      0.9835165     0.9416058

    Our top model, which got 94.89% on a test set, had nrounds = 25, eta = 0.3, and max_depth = 6.

    Either way, the trick was the wrapper function pl_xgboost() that let us bridge xgboost and pipelearner. Note that this same principle can be used for any other machine learning functions that don’t play nice with pipelearner.

     Bonus: bootstrapped cross validation

    For those of you who are comfortable, below is a bonus example of using 100 boostrapped cross validation samples to examine consistency in the accuracy. It doesn’t get much easier than using pipelearner!

    results <- pipelearner(d, pl_xgboost, cancer ~ ., nrounds = 25) %>% 
      learn_cvpairs(n = 100) %>% 
      learn() %>% 
        test_accuracy  = pmap_dbl(list(fit, test,  target), accuracy)
    #> [1]  train-rmse:0.357471 
    #> [2]  train-rmse:0.256735 
    #> ...
    results %>% 
      ggplot(aes(test_accuracy)) +
        geom_histogram(bins = 30) +
        scale_x_continuous(labels = scales::percent) +
        theme_minimal() +
        labs(x = "Accuracy", y = "Number of samples",
             title = "Test accuracy distribution for
    100 bootstrapped samples")


     Sign off

    Thanks for reading and I hope this was useful for you.

    For updates of recent blog posts, follow @drsimonj on Twitter, or email me atdrsimonjackson@gmail.com to get in touch.

    If you’d like the code that produced this blog, check out the blogR GitHub repository.


