Jared Wilber | 11 March, 2019

Keep It Together

Nested Workflows for Machine Learning with R.

Thanks to recentish developments in the tidyverse (namely the tidyr and purrr packages), one can easily create effecictive machine-learning workflows via nested data structures.

To present the idea, we'll work through a common machine-learning workflow: experimenting with different data preprocessing + machine learning model combinations. Such a workflow (discussed further in [5] & [6]) comprises multiple steps and lends itself well as an example for nesting multiple objects into the same dataframe. Each step of this workflow outputs a new data object, and thus the overall example provides a nice guide for nesting data in R.

This idea is not my own. Rather, what I'm presenting is a combination of things I've read online, in books, and from my own experimentation. Definitely givethe Referenes & Related Work section at the bottom a viewing to learn about cool related stuff in R.


Tidy Data

The foundation of our method relies on the concept of tidy data[reference]. Recall that tidy data refers to a consistent data structure as follows:



This is probably how you structure your data already, so formalizing it may seem like overkill. However, z Structuring our data in a tidy format provides us an intuitive mental model for analysis, keeps both code *and* data objects organized, and allows us to work with tidyverse functions. [1] However, structuring our data in a tidy format not only provides us with an intuitive mental model for data analysis, but it allows us to seamlessly work with tidyverse functions. Our nested dataframe will itself be tidy,



As a running example, we'll look at data from the US McDonald's menu to see if we can predict an item's category (e.g. `Breakfast`, `Desserts`, `Snacks & Sides`) from its nutritional content (e.g. `Calories`, `Total Fat`, `Sodium`) [2]:


# load libraries
library(caret)
library(tidyverse)

# load data, removing non-nutritional columns
menu <- read_csv('menu.csv') %>%
  select(-Item, -`Serving Size`) 
menu

# A tibble: 260 x 22
   Category Calories `Calories from ~ `Total Fat` `Total Fat (% D~ `Saturated Fat` `Saturated Fat ~ `Trans Fat` Cholesterol
   <chr>       <dbl>            <dbl>       <dbl>            <dbl>           <dbl>            <dbl>       <dbl>       <dbl>
 1 Breakfa~      300              120          13               20               5               25           0         260
 2 Breakfa~      250               70           8               12               3               15           0          25
 3 Breakfa~      370              200          23               35               8               42           0          45
 4 Breakfa~      450              250          28               43              10               52           0         285
 5 Breakfa~      400              210          23               35               8               42           0          50
 6 Breakfa~      430              210          23               36               9               46           1         300
 7 Breakfa~      460              230          26               40              13               65           0         250
 8 Breakfa~      520              270          30               47              14               68           0         250
 9 Breakfa~      410              180          20               32              11               56           0          35
10 Breakfa~      470              220          25               38              12               59           0          35
# ... with 250 more rows, and 13 more variables: `Cholesterol (% Daily Value)` <dbl>, Sodium <dbl>, `Sodium (% Daily
#   Value)` <dbl>, Carbohydrates <dbl>, `Carbohydrates (% Daily Value)` <dbl>, `Dietary Fiber` <dbl>, `Dietary Fiber (% Daily
#   Value)` <dbl>, Sugars <dbl>, Protein <dbl>, `Vitamin A (% Daily Value)` <dbl>, `Vitamin C (% Daily Value)` <dbl>, `Calcium
#   (% Daily Value)` <dbl>, `Iron (% Daily Value)` <dbl>

Throughout our analysis, we'll be making heavy use of a (relatively) new tidy data structure: the list-column. Whereas typical dataframes are populated with atomic vectors, list-columns allow us to nest any R object inside our dataframe. We'll use list-columns to nest entire dataframes, model objects, and other data types into our dataframe.

To begin, we'll use tidyr:enframe to nest our data with an accompanying index column:


# nest original data
menu <- rep(list(menu), 4) %>% 
  enframe(name = 'index', value = 'data')
menu

# A tibble: 4 x 2
  index data               
  <int> <list>             
  1     1 <tibble [260 x 22]>
  2     2 <tibble [260 x 22]>
  3     3 <tibble [260 x 22]>
  4     4 <tibble [260 x 22]>


Feature Transformation

In order to work with nested data, we need a way to apply a function to each nested element. We achieve this in R via purrr, a library that exposes functional tools for iteration. Our workflow for using purrr is straightforward. First, define a function to be applied to a single data-frame. Then, use purrr's map' suite of functions to apply the function over every row in the dataframe. This function will be applied inside a dplyr::mutate call, nesting our output via a list-column.

For presentation purposes, we'll create our own function for feature transformation. This function applies a basic power transformation[3] to all numeric features in our dataframe:


# function to apply a given power-transform to numerical columns
power_transform <- function(df, pow) {
  df %>%
    mutate_if(~ is.numeric(.x),
              ~ `^`(.x, pow)) %>%
    select(-Category)
}

Then, we carry out the purrr::map + dplyr::mutate workflow described above, furthering the development of our nested dataframe:


# add 3 columns to data: power, transformed_data, and label
menu <- menu %>% 
  mutate(
    power = c(0.5, 1, 2, 3),
    transformed_data = purrr::map2(data, power, ~ power_transform(.x, .y)),
    label = purrr::map(data, 'Category')
  )
menu

  # A tibble: 4 x 5
  index data                power trns_data           label      
  <int> <list>              <dbl> <list>              <list>     
1     1 <tibble [260 x 22]>   0.5 <tibble [260 x 21]> <chr [260]>
2     2 <tibble [260 x 22]>   1   <tibble [260 x 21]> <chr [260]>
3     3 <tibble [260 x 22]>   2   <tibble [260 x 21]> <chr [260]>
4     4 <tibble [260 x 22]>   3   <tibble [260 x 21]> <chr [260]>

In the above code, we create three new columns for our data-frame: power, trns_data, and label. The power column is simply an atomic vector we define ourselves. We create trns_data by using purrr::map2, which allows us to map a function to two arguments. Note that we're feeding columns from the dataframe as arguments to the function. This ability to easily apply the same function to different rows of related data is what makes the nested workflow so powerful. The label column is created by taking advantage of purrr's name selection capailities.


Machine Learning

We create list-columns for our machine-learning algorithms in a similar manner: combining functions, purrr, and dplyr::mutate.

To create our models, we'll use caret, a package that wraps different machine learning implementations in R with a uniform API. Rather than creating a separate function for each model, I'll write a function factory[reference] to produce the functions for us. Thanks to the consistent syntax provided by caret, this is a breeze:


# machine learning model function factory
mlFuncFact <- function(ml_method) {
  function(data, label) {
   caret::train(
      x = data,
      y = label,
      method = ml_method
    )
  }
}


Using this function factory, we create a list of our desired algorithms with ease. Note, because we want to enumerate each model + data-preprocessing combination, we'll first store our models into a separate data structure, and then combine them to our preprocessed data thereafter:


model_list <- list(
  decision_tree = mlFuncFact('rpart2'),
  random_forest = mlFuncFact('ranger'),
  boosted_log_reg = mlFuncFact('LogitBoost'),
  knn = mlFuncFact('knn'),
  svm = mlFuncFact('svmLinear3'),
  naive_bayes = mlFuncFact('naive_bayes'),
  partial_least_squared = mlFuncFact('pls'),
  linear_disc_analysis = mlFuncFact('lda')
  ) %>%
  enframe(name = 'model', value = 'model_func')
model_list

  # A tibble: 6 x 2
  model           model_func
  <chr>           <list>    
1 decision_tree   <fn>      
2 random_forest   <fn>      
3 boosted_log_reg <fn>      
4 knn             <fn>      
5 svm             <fn>      
6 lda             <fn>

Recall that our goal is to run each model + data-transformation combination. We can do this by creating a cartesian product[add link] (also known as a cross-join) on our two dataframes. tidyr provides this functionality via the crossing() function:


# cross-join original data with models
menu <- menu %>% 
  crossing(model_list) %>% 
  arrange(model, power)
menu

  # A tibble: 24 x 7
   index data                power trns_data           label       model           model_func
   <int> <list>              <dbl> <list>              <list>      <chr>           <list>    
 1     1 <tibble [260 x 22]>   0.5 <tibble [260 x 21]> <chr [260]> boosted_log_reg <fn>      
 2     2 <tibble [260 x 22]>   1   <tibble [260 x 21]> <chr [260]> boosted_log_reg <fn>      
 3     3 <tibble [260 x 22]>   2   <tibble [260 x 21]> <chr [260]> boosted_log_reg <fn>      
 4     4 <tibble [260 x 22]>   3   <tibble [260 x 21]> <chr [260]> boosted_log_reg <fn>      
 5     1 <tibble [260 x 22]>   0.5 <tibble [260 x 21]> <chr [260]> decision_tree   <fn>      
 6     2 <tibble [260 x 22]>   1   <tibble [260 x 21]> <chr [260]> decision_tree   <fn>      
 7     3 <tibble [260 x 22]>   2   <tibble [260 x 21]> <chr [260]> decision_tree   <fn>      
 8     4 <tibble [260 x 22]>   3   <tibble [260 x 21]> <chr [260]> decision_tree   <fn>      
 9     1 <tibble [260 x 22]>   0.5 <tibble [260 x 21]> <chr [260]> knn             <fn>      
10     2 <tibble [260 x 22]>   1   <tibble [260 x 21]> <chr [260]> knn             <fn>      
# ... with 14 more row

The output is a dataframe with nested columns corresponding to all of our possible data-preprocessing + machine learning combinations.

All Together Now

Now that we've created all unique model + data-transformation combinations, we can solve the models. To do so, we'll once again take advantage of purrr's map capailities. In this case, we'd like to create a new nest column by invoking a function, so we'll use invoke_map


# evaluate models
menu <- menu %>% 
  mutate(
    params = map2(transformed_data, label, ~ list(data = .x, label = .y)),
    model_fit = invoke_map(model_func, params)
  ) 
menu

  # A tibble: 24 x 9
   index data                power trns_data           label       model           model_func params     modelFits  
   <int> <list>              <dbl> <list>              <list>      <chr>           <list>     <list>     <list>     
 1     1 <tibble [260 x 22]>   0.5 <tibble [260 x 21]> <chr [260]> svm             <fn>       <list [2]> <S3: train>
 2     2 <tibble [260 x 22]>   1   <tibble [260 x 21]> <chr [260]> random_forest   <fn>       <list [2]> <S3: train>
 3     1 <tibble [260 x 22]>   0.5 <tibble [260 x 21]> <chr [260]> boosted_log_reg <fn>       <list [2]> <S3: train>
 4     1 <tibble [260 x 22]>   0.5 <tibble [260 x 21]> <chr [260]> random_forest   <fn>       <list [2]> <S3: train>
 5     4 <tibble [260 x 22]>   3   <tibble [260 x 21]> <chr [260]> boosted_log_reg <fn>       <list [2]> <S3: train>
 6     2 <tibble [260 x 22]>   1   <tibble [260 x 21]> <chr [260]> boosted_log_reg <fn>       <list [2]> <S3: train>
 7     3 <tibble [260 x 22]>   2   <tibble [260 x 21]> <chr [260]> boosted_log_reg <fn>       <list [2]> <S3: train>
 8     3 <tibble [260 x 22]>   2   <tibble [260 x 21]> <chr [260]> random_forest   <fn>       <list [2]> <S3: train>
 9     4 <tibble [260 x 22]>   3   <tibble [260 x 21]> <chr [260]> random_forest   <fn>       <list [2]> <S3: train>
10     2 <tibble [260 x 22]>   1   <tibble [260 x 21]> <chr [260]> svm             <fn>       <list [2]> <S3: train>
# ... with 14 more rows

Great, so now we have our data all together, and we've fit our models. We have everything that we need to easily identify the best models, now we just need to extract the results. Once again, we take advantage of caret's consistent API and map a simple function across our model_list:


# extract results for each model
trained_models <- menu %>% 
  mutate(
    accuracy = map_dbl(model_fit, ~max(.x$results$Accuracy)),
    accuracySD = map_dbl(model_fit, ~max(.x$results$AccuracySD)),
  ) %>% 
  arrange(desc(accuracy))

trained_models %>% 
  select(power, model, accuracy)

  # A tibble: 24 x 11
   index data              power trns_data          label       model         model_func params    modelFits  accuracy accuracySD
   <int> <list>            <dbl> <list>             <list>      <chr>         <list>     <list>    <list>        <dbl>      <dbl>
 1     1 <tibble [260 x 2~   0.5 <tibble [260 x 21~ <chr [260]> svm           <fn>       <list [2~ <S3: trai~    0.828     0.0545
 2     2 <tibble [260 x 2~   1   <tibble [260 x 21~ <chr [260]> random_forest <fn>       <list [2~ <S3: trai~    0.822     0.0506
 3     1 <tibble [260 x 2~   0.5 <tibble [260 x 21~ <chr [260]> boosted_log_~ <fn>       <list [2~ <S3: trai~    0.821     0.0544
 4     1 <tibble [260 x 2~   0.5 <tibble [260 x 21~ <chr [260]> random_forest <fn>       <list [2~ <S3: trai~    0.820     0.0376
 5     4 <tibble [260 x 2~   3   <tibble [260 x 21~ <chr [260]> boosted_log_~ <fn>       <list [2~ <S3: trai~    0.816     0.0534
 6     2 <tibble [260 x 2~   1   <tibble [260 x 21~ <chr [260]> boosted_log_~ <fn>       <list [2~ <S3: trai~    0.815     0.0437
 7     3 <tibble [260 x 2~   2   <tibble [260 x 21~ <chr [260]> boosted_log_~ <fn>       <list [2~ <S3: trai~    0.815     0.0490
 8     3 <tibble [260 x 2~   2   <tibble [260 x 21~ <chr [260]> random_forest <fn>       <list [2~ <S3: trai~    0.813     0.0486
 9     4 <tibble [260 x 2~   3   <tibble [260 x 21~ <chr [260]> random_forest <fn>       <list [2~ <S3: trai~    0.805     0.0393
10     2 <tibble [260 x 2~   1   <tibble [260 x 21~ <chr [260]> svm           <fn>       <list [2~ <S3: trai~    0.766     0.112 
# ... with 14 more rows

So our best model is ____. One added benefit to our analysis is that, not only is everything organized (i.e. all related items stored together), but it's organized in a tidy format. This allows us to do further analysis on our data using tidy functions. For example, we can easily pipe our results into ggplot to visually compare our results:


# plot model accuracies
trained_models %>% 
  ggplot(aes(x = as.factor(power), colour = model)) +
  geom_point(aes(y = accuracy), size = 2) +
  geom_errorbar(aes(ymin = accuracy - accuracySD,
                    ymax = accuracy + accuracySD)) +
  theme_classic() +
  facet_wrap(~model)
Model Results Plot

So our best model is ____. One added benefit to our analysis is that, not only is everything organized (i.e. all related items stored together), but it's organized in a tidy format. This allows us to do further analysis on our data using tidy functions. For example, we can easily pipe our results into ggplot to visually compare our results:


References & Related Work