**Introduction**

One of the quirks of programming in R is having a deluge of packages with different interfaces, although paradoxically that is also one of the primary benefits.

This is different than the Python experience where most of machine learning is contained in Scikit-Learn. Although Python does require us to remember which components to import at least everything is in the same place.

*Tidymodels* is a modern attempt to harmonize the many machine learning packages for R the Tidyverse way. It is not the only solution, another popular modern option is *mlr3 *which is featured in prior articles here and also here.

There is a more technical overview of the *tidymodels* packages available at the tidymodels website. This post is intended as a high level overview from a modeling perspective. Although the framework is intended to harmonize machine learning, this is accomplished with an array of new packages to learn and the documentation is somewhat fragmented for that reason.

**Tidy Impressions**

*Tidymodels* has a very functional workflow, in a strange way it actually seems more verbose than *mlr3, *which sets a high bar in that regard. It is somewhat reminiscent of *mlr2 *which featured excessively long function names, but in spite of that it has a logical feel.

If you hate the R6 syntax and like to keep things simple this is a great framework. If you don’t mind R6 then you might find it takes an unnecessary amount of typing. Same goes for the masses of pipes that will become your code, if you like that structure you’ll feel right at home.

All-in-all it’s a very slick machine learning framework, and is uniquely readable which will be appreciated by anyone having to maintain your code, including your future self.

**Package Structure**

One of the first things to get acquainted with in *tidymodels *is the rather large number of core packages:

- Workflows: This is the tunable data pipeline that contains a recipe and preprocessing.
- Parsnip: This is the model definition that goes into a recipe (just like the real thing).
- Recipes: This is the component to define preprocessing steps and formulas.
- Dials: Allows specification of Hyper-parameter functions for tuning algorithms.
- Tune: This is what is used, perhaps unsurprisingly, to tune model hyper parameters.
- Yardstick: This is where you define measures to optimize and evaluate your model on.
- Rsample: This is what you will use to conduct resampled hyper parameter tuning.

There are also some additional packages such as *multilevel* for hierarchical models, *fine tune* for additional parameter searching algorithms, and *usemodels *for the lazy (or efficient depending on your point of view) data scientist.

**Tidymodels Process**

Below we divide the data science workflow into three phases:

- Model Workflow: Model selection and preprocessing steps.
- Tuning: Tune our hyper-parameters to get the best results.
- Evaluation: Measure our results.

**Model Workflow**

We are not going to get deep into analysis in this article, but we will use a kaggle dataset on car prices described here.

Our first step is to create the ubiquitous validation set using an initial split, in this case 20%. These functions are contaned in the *rsample* package which is loaded as part of the core.

From this point forward we can reference our “CarTrain” set for training and use our “CarTest” set for tuning validation.

```
CarSplit <- initial_split(CarMod, prop = 0.80)
CarTrain <- training(CarSplit)
CarTest <- testing(CarSplit)
```

Once we have created our testing set, we need to create a few objects:

- Parsnip: We tell R that we’re doing linear regression using the standard “lm” function.
- Recipe: We provide a formula, a few preprocessing steps, and an interaction term.
- Workflow: We define a workflow with the above components

```
## Parsnip Model
LinReg <- linear_reg() %>% set_engine("lm")
## Create Recipe
lm.mod <- recipe(price ~ citympg + highwaympg + horsepower + fueltype, data = CarTrain) %>%
step_dummy(all_nominal_predictors()) %>%
step_scale(all_predictors()) %>%
step_interact(~horsepower:citympg)
lm.mod
### Create Workflow
Lm.Flow <- workflow() %>%
add_model(LinReg) %>%
add_recipe(lm.mod)
Lm.Flow
```

Note that while we could fit our model without a workflow, the benefit of a workflow is that it allows us to effectively treat the model fitting and preprocessing stage as a single unit during tuning.

This will also be helpful when we have new data for predictions because the workflow will apply the required preprocessing and feed it into the fitted model without any extra effort.

Note that there are are some very helpful functions for variable selection within with tidyverse pipe framework such as *all_nominal_predictors() *which collects all of the factor variables. The framework is smart enough to ignore the response variable in these calls which makes them quite intuitive.

**Tuning Phase**

Resampling and tuning are fairly straightforward. We need to identify and flag hyper-paremters for tuning with the *tune() *function. This is where *tune * and *rsample* come into the picture.

Since our model currently has nothing to tune we will first add a polynomial term by updating our model and updating the recipe contained in our workflow.

```
### Add New Parameter
lm.mod <- lm.mod %>% step_poly(horsepower, degree = tune("degrees"))
lm.mod
Lm.Flow <- Lm.Flow %>% update_recipe(lm.mod)
Lm.Flow
```

Next we will set up repeated 3 fold cross validation and set up a tuning grid. The *tune_grid() *function requires a resampling strategy, and either an integer or dataframe representing the tuning grid. Since we only have one simple parameter we will ask *tune_grid() to *create 10 randomly chosen sets containing our single parameter.

One this process has been completed we can examine our metrics and choose the best tune based on our preferred metric, this this case RMSE.

```
### Cross Validation
lm.cv <- vfold_cv(CarTrain, v = 3, repeats = 5)
lm.cv
Tune.res <-
Lm.Flow %>%
tune_grid(
resamples = lm.cv,
grid = 10
)
Tune.res %>% collect_metrics()
Best.params <- Tune.res %>% select_best(metric = "rmse")
```

We can add in the optimal parameters with one of the *finalize* family of functions. Since we are using a workflow we need to call on the *finalize_workflow()* function.

`Lm.Final <- finalize_workflow(Lm.Flow, Best.params)`

Once our hyper parameters are set we can then fit the workflow (including the model ) on our training data.

```
### Fit Model
Lm.Fit <- fit(Lm.Final, CarTrain)
Lm.Fit
```

**Evaluation Phase**

With the fitted workflow, we can now run predictions on our validation set and see how our models performs on unseen data. This is where the *yardstick *package comes into play.

We specify some metrics in a metric set, and then calculate them after specifying the ground truth and estimate columns.

```
lm.res <- predict(Lm.Fit, new_data = CarTest)
final.res <- bind_cols(lm.res, CarTest %>% select(price))
metrics <- metric_set(rmse, rsq, mae)
metrics(final.res, truth = price, estimate = .pred)
```

```
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 4000.
2 rsq standard 0.746
3 mae standard 3125.
```

**Conclusion**

Tidymodels provides a logical and readable machine learning framework for R that will make tidyverse fans feel right at home.

This post was intended to cover a basic end-to-end *tidymodels* workflow highlighting the steps and packages that come into play. We encourage the reader to explore the package documentation to gain a more in-depth understanding of the many features available.

**References**

Kuhn, Max; Silge, Julia. *Tidy Modeling with R*. O’Reilly, 2022.

Boehmke, Brad; Greenwell, Brandon. *Hands-On Machine Learning with R.* Chapman Hall/CRC, 2019.

Kuhn, Max; Johnson, Kjell. *Feature Engineering and Selection.* Chapman Hall/CRC, 2020.

##### Recent Post

###### DAX Calculated Columns with EARLIER

- 20 July 2024
- 5 min read

###### Exploring Tidymodels Machine Learning Framework

- 17 July 2024
- 7 min read

###### Bayesian Binomial Product Analysis

- 14 July 2024
- 6 min read

###### Product Analytics with Binomial Regression

- 13 July 2024
- 12 min read

###### Flexible Bayesian Modeling in Stan

- 8 July 2024
- 13 min read