Jared Wilber | 11 March, 2019

Keep It Together

Using the tidyverse for machine learning.

I've recently discovered a new approach for doing data analysis in R. In one sentence, the approach can be summarised easily:

Nest all related things together into a single data-frame.

It's a simple idea, but quite powerful for the following reasons:

To present the idea, I'll use a common machine-learning workflow as a running example: trying out different preprocessing + model combinations. This workflow (discussed further in [5] & [6]) comprises multiple steps and lends itself well as an example for nesting multiple objects into the same dataframe.

This idea is not my own. Rather, what I'm presenting is a combination of things I've read online, in books, and from my own experimentation. Check out the Referenes & Related Work section at the bottom for more.


Tidy Data

As a quick recap, tidy data refers to a consistent data structure as follows:



Structuring our data in a tidy format provides us an intuitive mental model for analysis, keeps both code *and* data objects organized, and allows us to work with tidyverse functions. [1]

[perhaps here I can add a brief intro into the tools we'll be using (i.e. list-columns, purrr, and caret)]

As a running example, we'll look at data from the US McDonald's menu to see if we can predict an item's category (e.g. `Breakfast`, `Desserts`, `Snacks & Sides`) from its nutritional content (e.g. `Calories`, `Total Fat`, `Sodium`) [2]:


# load libraries
library(caret)
library(tidyverse)

# load data, removing non-nutritional columns
menu <- read_csv('menu.csv') %>%
  select(-Item, -`Serving Size`) 
menu

# A tibble: 260 x 22
   Category Calories `Calories from ~ `Total Fat` `Total Fat (% D~ `Saturated Fat` `Saturated Fat ~ `Trans Fat` Cholesterol
   <chr>       <dbl>            <dbl>       <dbl>            <dbl>           <dbl>            <dbl>       <dbl>       <dbl>
 1 Breakfa~      300              120          13               20               5               25           0         260
 2 Breakfa~      250               70           8               12               3               15           0          25
 3 Breakfa~      370              200          23               35               8               42           0          45
 4 Breakfa~      450              250          28               43              10               52           0         285
 5 Breakfa~      400              210          23               35               8               42           0          50
 6 Breakfa~      430              210          23               36               9               46           1         300
 7 Breakfa~      460              230          26               40              13               65           0         250
 8 Breakfa~      520              270          30               47              14               68           0         250
 9 Breakfa~      410              180          20               32              11               56           0          35
10 Breakfa~      470              220          25               38              12               59           0          35
# ... with 250 more rows, and 13 more variables: `Cholesterol (% Daily Value)` <dbl>, Sodium <dbl>, `Sodium (% Daily
#   Value)` <dbl>, Carbohydrates <dbl>, `Carbohydrates (% Daily Value)` <dbl>, `Dietary Fiber` <dbl>, `Dietary Fiber (% Daily
#   Value)` <dbl>, Sugars <dbl>, Protein <dbl>, `Vitamin A (% Daily Value)` <dbl>, `Vitamin C (% Daily Value)` <dbl>, `Calcium
#   (% Daily Value)` <dbl>, `Iron (% Daily Value)` <dbl>

Throughout our analysis, we'll be making heavy use of a (relatively) new tidy data structure: the list-column. Whereas typical dataframes are populated with atomic vectors, list-columns allow us to nest any R object inside our dataframe. To begin, we'll use tidyr:enframe to nest our data with an accompanying index column:


# nest original data
menu <- rep(list(menu), 4) %>% 
  enframe(name = 'index', value = 'data')
menu

# A tibble: 4 x 2
  index data               
  <int> <list>             
  1     1 <tibble [260 x 22]>
  2     2 <tibble [260 x 22]>
  3     3 <tibble [260 x 22]>
  4     4 <tibble [260 x 22]>


Feature Transformation

Feature transformation is an important step of any machine learning workflow. purrr in R exposes functional tools for iteration. We'll make heavy use of these to grow our dataframe and aid in our analysis. For presentation, we'll create our own function for feature transformation. Our function will allow us to apply a basic power transformation[3] to all numeric features in our dataframe:


# function to apply a given power-transform to numerical columns
power_transform <- function(df, pow) {
  df %>%
    mutate_if(~ is.numeric(.x),
              ~ `^`(.x, pow)) %>%
    select(-Category)
}

Once we've created our UDF, we can use purrr, R's fp package, to easily create a new list column, applying our function to our data wit multiple parameters. We want to apply and keep track of a number of different power transformations, so we'll create two new columns: `power`: A list of powers for our power transformations. `trns_data`: Our dataframe with applied power transformations.


# add 3 columns to data: power, transformed_data, and label
menu <- menu %>% 
  mutate(
    power = c(0.5, 1, 2, 3),
    transformed_data = purrr::map2(data, power, ~ power_transform(.x, .y)),
    label = purrr::map(data, 'Category')
  )
menu

  # A tibble: 4 x 5
  index data                power trns_data           label      
  <int> <list>              <dbl> <list>              <list>     
1     1 <tibble [260 x 22]>   0.5 <tibble [260 x 21]> <chr [260]>
2     2 <tibble [260 x 22]>   1   <tibble [260 x 21]> <chr [260]>
3     3 <tibble [260 x 22]>   2   <tibble [260 x 21]> <chr [260]>
4     4 <tibble [260 x 22]>   3   <tibble [260 x 21]> <chr [260]>

Our dataframe now has three additional columns: 1, 2,& 3. `transformed_data` uses purrr::map2 to apply our user-defined function to two arguments in a row-wise manner: the `data` column and the `power` column. `label` uses purrr's handy name shortcut[4] functionality to grab our respone variable, `Category`.


Machine Learning

We'll add list-columns for ml algorithms along the same lines as earlier, using UDF's and purrr. To create our ml algorithms, we'll use caret, a package that wraps different ml packages in R with a uniform API. I take advantage of the consistent syntax provided by caret and create a function factory to output functions for each of our desired models: To easily create a function for each of our desired ml algorithms, I create a function factory. Thanks to the consistent syntax provided by caret, this is a breeze: Using this function factory, we create a list of our desired algorithms with ease:


# machine learning model function factory
mlFuncFact <- function(ml_method) {
  function(data, label) {
   caret::train(
      x = data,
      y = label,
      method = ml_method
    )
  }
}

# create list of models
model_list <- list(
  decision_tree = mlFuncFact('rpart2'),
  random_forest = mlFuncFact('ranger'),
  boosted_log_reg = mlFuncFact('LogitBoost'),
  knn = mlFuncFact('knn'),
  svm = mlFuncFact('svmLinear3'),
  naive_bayes = mlFuncFact('naive_bayes'),
  partial_least_squared = mlFuncFact('pls'),
  linear_disc_analysis = mlFuncFact('lda')
  ) %>%
  enframe(name = 'model', value = 'model_func')
model_list

  # A tibble: 6 x 2
  model           model_func
  <chr>           <list>    
1 decision_tree   <fn>      
2 random_forest   <fn>      
3 boosted_log_reg <fn>      
4 knn             <fn>      
5 svm             <fn>      
6 lda             <fn>

Next, we combine our models and our transformed data into a single data-frame. Recall we want to run each model + data-transformation combination. We can do this by creating a cartesian product (also known as a cross-join) of the two desired columns. Luckily for us, tidyr provides this functionality via the crossing() function:


# cross-join original data with models
menu <- menu %>% 
  crossing(model_list) %>% 
  arrange(model, power)
menu

  # A tibble: 24 x 7
   index data                power trns_data           label       model           model_func
   <int> <list>              <dbl> <list>              <list>      <chr>           <list>    
 1     1 <tibble [260 x 22]>   0.5 <tibble [260 x 21]> <chr [260]> boosted_log_reg <fn>      
 2     2 <tibble [260 x 22]>   1   <tibble [260 x 21]> <chr [260]> boosted_log_reg <fn>      
 3     3 <tibble [260 x 22]>   2   <tibble [260 x 21]> <chr [260]> boosted_log_reg <fn>      
 4     4 <tibble [260 x 22]>   3   <tibble [260 x 21]> <chr [260]> boosted_log_reg <fn>      
 5     1 <tibble [260 x 22]>   0.5 <tibble [260 x 21]> <chr [260]> decision_tree   <fn>      
 6     2 <tibble [260 x 22]>   1   <tibble [260 x 21]> <chr [260]> decision_tree   <fn>      
 7     3 <tibble [260 x 22]>   2   <tibble [260 x 21]> <chr [260]> decision_tree   <fn>      
 8     4 <tibble [260 x 22]>   3   <tibble [260 x 21]> <chr [260]> decision_tree   <fn>      
 9     1 <tibble [260 x 22]>   0.5 <tibble [260 x 21]> <chr [260]> knn             <fn>      
10     2 <tibble [260 x 22]>   1   <tibble [260 x 21]> <chr [260]> knn             <fn>      
# ... with 14 more row

All Together Now

Now that we've created all unique model + data-transformation combinations, we can solve the models. To do so, we'll once again take advantage of purrr's map capailities. In this case, we'd like to create a new nest column by invoking a function, so we'll use invoke_map


# evaluate models
menu <- menu %>% 
  mutate(
    params = map2(transformed_data, label, ~ list(data = .x, label = .y)),
    model_fit = invoke_map(model_func, params)
  ) 
menu

  # A tibble: 24 x 9
   index data                power trns_data           label       model           model_func params     modelFits  
   <int> <list>              <dbl> <list>              <list>      <chr>           <list>     <list>     <list>     
 1     1 <tibble [260 x 22]>   0.5 <tibble [260 x 21]> <chr [260]> svm             <fn>       <list [2]> <S3: train>
 2     2 <tibble [260 x 22]>   1   <tibble [260 x 21]> <chr [260]> random_forest   <fn>       <list [2]> <S3: train>
 3     1 <tibble [260 x 22]>   0.5 <tibble [260 x 21]> <chr [260]> boosted_log_reg <fn>       <list [2]> <S3: train>
 4     1 <tibble [260 x 22]>   0.5 <tibble [260 x 21]> <chr [260]> random_forest   <fn>       <list [2]> <S3: train>
 5     4 <tibble [260 x 22]>   3   <tibble [260 x 21]> <chr [260]> boosted_log_reg <fn>       <list [2]> <S3: train>
 6     2 <tibble [260 x 22]>   1   <tibble [260 x 21]> <chr [260]> boosted_log_reg <fn>       <list [2]> <S3: train>
 7     3 <tibble [260 x 22]>   2   <tibble [260 x 21]> <chr [260]> boosted_log_reg <fn>       <list [2]> <S3: train>
 8     3 <tibble [260 x 22]>   2   <tibble [260 x 21]> <chr [260]> random_forest   <fn>       <list [2]> <S3: train>
 9     4 <tibble [260 x 22]>   3   <tibble [260 x 21]> <chr [260]> random_forest   <fn>       <list [2]> <S3: train>
10     2 <tibble [260 x 22]>   1   <tibble [260 x 21]> <chr [260]> svm             <fn>       <list [2]> <S3: train>
# ... with 14 more rows

Great, so now we have our data all together, and we've fit our models. We have everything that we need to easily identify the best models, now we just need to extract the results. Once again, we take advantage of caret's consistent API and map a simple function across our model_list:


# extract results for each model
trained_models <- menu %>% 
  mutate(
    accuracy = map_dbl(model_fit, ~max(.x$results$Accuracy)),
    accuracySD = map_dbl(model_fit, ~max(.x$results$AccuracySD)),
  ) %>% 
  arrange(desc(accuracy))

trained_models %>% 
  select(power, model, accuracy)

  # A tibble: 24 x 11
   index data              power trns_data          label       model         model_func params    modelFits  accuracy accuracySD
   <int> <list>            <dbl> <list>             <list>      <chr>         <list>     <list>    <list>        <dbl>      <dbl>
 1     1 <tibble [260 x 2~   0.5 <tibble [260 x 21~ <chr [260]> svm           <fn>       <list [2~ <S3: trai~    0.828     0.0545
 2     2 <tibble [260 x 2~   1   <tibble [260 x 21~ <chr [260]> random_forest <fn>       <list [2~ <S3: trai~    0.822     0.0506
 3     1 <tibble [260 x 2~   0.5 <tibble [260 x 21~ <chr [260]> boosted_log_~ <fn>       <list [2~ <S3: trai~    0.821     0.0544
 4     1 <tibble [260 x 2~   0.5 <tibble [260 x 21~ <chr [260]> random_forest <fn>       <list [2~ <S3: trai~    0.820     0.0376
 5     4 <tibble [260 x 2~   3   <tibble [260 x 21~ <chr [260]> boosted_log_~ <fn>       <list [2~ <S3: trai~    0.816     0.0534
 6     2 <tibble [260 x 2~   1   <tibble [260 x 21~ <chr [260]> boosted_log_~ <fn>       <list [2~ <S3: trai~    0.815     0.0437
 7     3 <tibble [260 x 2~   2   <tibble [260 x 21~ <chr [260]> boosted_log_~ <fn>       <list [2~ <S3: trai~    0.815     0.0490
 8     3 <tibble [260 x 2~   2   <tibble [260 x 21~ <chr [260]> random_forest <fn>       <list [2~ <S3: trai~    0.813     0.0486
 9     4 <tibble [260 x 2~   3   <tibble [260 x 21~ <chr [260]> random_forest <fn>       <list [2~ <S3: trai~    0.805     0.0393
10     2 <tibble [260 x 2~   1   <tibble [260 x 21~ <chr [260]> svm           <fn>       <list [2~ <S3: trai~    0.766     0.112 
# ... with 14 more rows

So our best model is ____. One added benefit to our analysis is that, not only is everything organized (i.e. all related items stored together), but it's organized in a tidy format. This allows us to do further analysis on our data using tidy functions. For example, we can easily pipe our results into ggplot to visually compare our results:


# plot model accuracies
trained_models %>% 
  ggplot(aes(x = as.factor(power), colour = model)) +
  geom_point(aes(y = accuracy), size = 2) +
  geom_errorbar(aes(ymin = accuracy - accuracySD,
                    ymax = accuracy + accuracySD)) +
  theme_classic() +
  facet_wrap(~model)
Model Results Plot

So our best model is ____. One added benefit to our analysis is that, not only is everything organized (i.e. all related items stored together), but it's organized in a tidy format. This allows us to do further analysis on our data using tidy functions. For example, we can easily pipe our results into ggplot to visually compare our results:


References & Related Work