Exploring Tidymodels Machine Learning Framework

Dustin

17 July 2024
7 min read

Introduction

One of the quirks of programming in R is having a deluge of packages with different interfaces, although paradoxically that is also one of the primary benefits.

This is different than the Python experience where most of machine learning is contained in Scikit-Learn. Although Python does require us to remember which components to import at least everything is in the same place.

Tidymodels is a modern attempt to harmonize the many machine learning packages for R the Tidyverse way. It is not the only solution, another popular modern option is mlr3 which is featured in prior articles here and also here.

There is a more technical overview of the tidymodels packages available at the tidymodels website. This post is intended as a high level overview from a modeling perspective. Although the framework is intended to harmonize machine learning, this is accomplished with an array of new packages to learn and the documentation is somewhat fragmented for that reason.

Tidy Impressions

Tidymodels has a very functional workflow, in a strange way it actually seems more verbose than mlr3, which sets a high bar in that regard. It is somewhat reminiscent of mlr2 which featured excessively long function names, but in spite of that it has a logical feel.

If you hate the R6 syntax and like to keep things simple this is a great framework. If you don’t mind R6 then you might find it takes an unnecessary amount of typing. Same goes for the masses of pipes that will become your code, if you like that structure you’ll feel right at home.

All-in-all it’s a very slick machine learning framework, and is uniquely readable which will be appreciated by anyone having to maintain your code, including your future self.

Package Structure

One of the first things to get acquainted with in tidymodels is the rather large number of core packages:

Workflows: This is the tunable data pipeline that contains a recipe and preprocessing.
Parsnip: This is the model definition that goes into a recipe (just like the real thing).
Recipes: This is the component to define preprocessing steps and formulas.
Dials: Allows specification of Hyper-parameter functions for tuning algorithms.
Tune: This is what is used, perhaps unsurprisingly, to tune model hyper parameters.
Yardstick: This is where you define measures to optimize and evaluate your model on.
Rsample: This is what you will use to conduct resampled hyper parameter tuning.

There are also some additional packages such as multilevel for hierarchical models, fine tune for additional parameter searching algorithms, and usemodels for the lazy (or efficient depending on your point of view) data scientist.

Tidymodels Process

Below we divide the data science workflow into three phases:

Model Workflow: Model selection and preprocessing steps.
Tuning: Tune our hyper-parameters to get the best results.
Evaluation: Measure our results.

Model Workflow

We are not going to get deep into analysis in this article, but we will use a kaggle dataset on car prices described here.

Our first step is to create the ubiquitous validation set using an initial split, in this case 20%. These functions are contaned in the rsample package which is loaded as part of the core.

From this point forward we can reference our “CarTrain” set for training and use our “CarTest” set for tuning validation.

CarSplit <- initial_split(CarMod, prop = 0.80)
CarTrain <- training(CarSplit)
CarTest <- testing(CarSplit)

Once we have created our testing set, we need to create a few objects:

Parsnip: We tell R that we’re doing linear regression using the standard “lm” function.
Recipe: We provide a formula, a few preprocessing steps, and an interaction term.
Workflow: We define a workflow with the above components

## Parsnip Model
LinReg <- linear_reg() %>% set_engine("lm")

## Create Recipe
lm.mod <- recipe(price ~ citympg + highwaympg + horsepower + fueltype, data = CarTrain) %>%
  step_dummy(all_nominal_predictors()) %>% 
  step_scale(all_predictors()) %>% 
  step_interact(~horsepower:citympg)
lm.mod

### Create Workflow
Lm.Flow <- workflow() %>% 
  add_model(LinReg) %>%
  add_recipe(lm.mod)
Lm.Flow

Note that while we could fit our model without a workflow, the benefit of a workflow is that it allows us to effectively treat the model fitting and preprocessing stage as a single unit during tuning.

This will also be helpful when we have new data for predictions because the workflow will apply the required preprocessing and feed it into the fitted model without any extra effort.

Note that there are are some very helpful functions for variable selection within with tidyverse pipe framework such as all_nominal_predictors() which collects all of the factor variables. The framework is smart enough to ignore the response variable in these calls which makes them quite intuitive.

Tuning Phase

Resampling and tuning are fairly straightforward. We need to identify and flag hyper-paremters for tuning with the tune() function. This is where tune and rsample come into the picture.

Since our model currently has nothing to tune we will first add a polynomial term by updating our model and updating the recipe contained in our workflow.

### Add New Parameter
lm.mod <- lm.mod %>% step_poly(horsepower, degree = tune("degrees"))
lm.mod
Lm.Flow <- Lm.Flow %>% update_recipe(lm.mod)
Lm.Flow

Next we will set up repeated 3 fold cross validation and set up a tuning grid. The tune_grid() function requires a resampling strategy, and either an integer or dataframe representing the tuning grid. Since we only have one simple parameter we will ask tune_grid() to create 10 randomly chosen sets containing our single parameter.

One this process has been completed we can examine our metrics and choose the best tune based on our preferred metric, this this case RMSE.

### Cross Validation
lm.cv <- vfold_cv(CarTrain, v = 3, repeats = 5)
lm.cv

Tune.res <- 
  Lm.Flow %>% 
  tune_grid(
    resamples = lm.cv,
    grid = 10
  )

Tune.res %>% collect_metrics()
Best.params <- Tune.res %>% select_best(metric = "rmse")

We can add in the optimal parameters with one of the finalize family of functions. Since we are using a workflow we need to call on the finalize_workflow() function.

Lm.Final <- finalize_workflow(Lm.Flow, Best.params)

Once our hyper parameters are set we can then fit the workflow (including the model ) on our training data.

### Fit Model
Lm.Fit <- fit(Lm.Final, CarTrain)
Lm.Fit

Evaluation Phase

With the fitted workflow, we can now run predictions on our validation set and see how our models performs on unseen data. This is where the yardstick package comes into play.

We specify some metrics in a metric set, and then calculate them after specifying the ground truth and estimate columns.

lm.res <- predict(Lm.Fit, new_data = CarTest)
final.res <- bind_cols(lm.res, CarTest %>% select(price))

metrics <- metric_set(rmse, rsq, mae)
metrics(final.res, truth = price, estimate = .pred)

# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard    4000.   
2 rsq     standard       0.746
3 mae     standard    3125.

Conclusion

Tidymodels provides a logical and readable machine learning framework for R that will make tidyverse fans feel right at home.

This post was intended to cover a basic end-to-end tidymodels workflow highlighting the steps and packages that come into play. We encourage the reader to explore the package documentation to gain a more in-depth understanding of the many features available.

References

Kuhn, Max; Silge, Julia. Tidy Modeling with R. O’Reilly, 2022.

Boehmke, Brad; Greenwell, Brandon. Hands-On Machine Learning with R. Chapman Hall/CRC, 2019.

Kuhn, Max; Johnson, Kjell. Feature Engineering and Selection. Chapman Hall/CRC, 2020.