Introduction
Transformations are changes to our response variable that can be employed to solve issues such as a lack of homogeneous variance, or to make interpretation easier as in the case of the log-log price elasticity model.
MLR3 Pipelines provides an automated way to handle transformations and where required also to transform back to the original scale during prediction.
Let’s explore a few ways to use transformations with the MLR3 package and mlr3pipelines in particular.
Dataset
We will use a freely available Kaggle dataset that lists car prices based on an assortment of features such as length, horsepower, and city mileage.
Preparing Data
First we will do some preparation on our dataset and create a regression task.
options(mc.cores = 4)
library(tidyverse)
library(mlr3verse)
library(mlr3pipelines)
CarPrices <- read.csv("CarSales.csv", stringsAsFactors = T)
CarPrices$curbweight <- NULL
CarPrices$symboling <- NULL
CarPrices$CarName <- NULL
CarPrices$car_ID <- NULL
CarMod <- mutate(
CarPrices,
cylindernumber = case_when(
cylindernumber == "one" ~ 1,
cylindernumber == "two" ~ 2,
cylindernumber == "three" ~ 3,
cylindernumber == "four" ~ 4,
cylindernumber == "five" ~ 5,
cylindernumber == "six" ~ 6,
cylindernumber == "seven" ~ 7,
cylindernumber == "eight" ~ 8,
cylindernumber == "twelve" ~ 12
)
)
PriceTask <- TaskRegr$new(
id = "PriceTask",
backend = CarMod,
target = "price"
)
Create Learner
For this example we will create a single regression learning using glmnet.
GLearner = lrn("regr.glmnet",
id = "GlmModel",
s = to_tune(p_int(0, 15)),
alpha = to_tune(p_dbl(0, 1))
)
Independent Variables
The engine size variable is not normally distributed, for a demonstration of the mutate functionality we will apply a log transformation. In this case it seems to help somewhat but there there is still a lot of density near 4.5 on the log scale.
Note that there are specific pipeline operators for common transformations such as Box-Cox or Yeo-Johnson, using these operators may be preferable where we don’t need the flexibility of a customized mutation function.
par(mfrow=c(1,2))
hist(CarMod$enginesize, breaks = 20, main = "Engine Size: No Transform",
xlab = "Engine Size", col="beige")
hist(log(CarMod$enginesize), breaks = 20, main = "Engine Size: Log Transform",
xlab = "log(Enginesize)", col = "pink")
Transformation Pipeline
We’ll create a piece of pipeline that performs feature encoding of categorical features, then a log transformation on engine size, and finally a log transformation on the target.
The final step uses a pre-built pipeline availabe in in mlr3 called pipeline_targettrafo which sets up the target transformation. The TRAFO module contains a mutation step that passes the transformed target to a learner along with an inversion function. From there the inverter will call that function to carry out the back transformation.
This can be created from scratch with the target mutate and target invert pipe operators but using the existing pipeline for a standard procedure saves time and avoids having to worry about which inputs are going into which output channels of the component operators.
The learner we are specifying below is simply a clone of the above glmnet model, including the same tuning tokens.
Note that while the manual suggests specifying the inverter function as a named list we found simply using exp(x) in the inverter function call correctly mapped it to the response variable as it does in the trafo function.
par(mfrow = c(1,1))
## Feature Encoder
Encoder <- po("encode")
## Mutation on Independent Vars
mutate.pipe <- po("mutate")
mutations = list(
PM = ~log(enginesize)
)
mutate.pipe$param_set$values$mutation = mutations
## Target TRAFO
tt <- pipeline_targettrafo(PipeOpLearner$new(GLearner$clone()))
tt$param_set$values$targetmutate.trafo = function(x) log(x)
tt$param_set$values$targetmutate.inverter = function(x) list(response = exp(x$response))
plot(tt)
Next we will build out the remainder of the transformation pipeline by first converting this section into an integrated learner with as_learner() and then using the pipeline graph constructor %>>% before plotting the result.
TrafoLearn <- as_learner(tt, id = "TRAFO")
graph <- Encoder %>>% mutate.pipe %>>% TrafoLearn
graph$plot()
Tuning the Pipeline
Once our mutations are set up we can tune our hyperparameters as usual. A more detailed explanation is available in another article Benchmarking Vehicle Prices with MLR3. However, essentially we are specifying a tuning strategy, in this case Bayesian optimization for 120 seconds using 3 fold cross validation.
The entire pipeline we created will be tuned over as a unified learner when hyper parameters are selected. In addition when we have new data for predictions the transformations will also be applied to the new data.
### Sampling Loop
resampling_inner = rsmp("holdout")
resampling_outer = rsmp("cv", folds = 3)
terminator = trm("run_time", secs = 120)
tuner = tnr("mbo")
measure = msr("regr.rmse")
Splits = partition(PriceTask, ratio = 0.9, stratify = T)
future::multisession(workers = 4)
set.seed(123)
instance = tune(
tuner = tuner,
task = PriceTask$filter(rows = Splits$train),
learner = FinalLearner,
resampling = rsmp("cv", folds = 3),
measures = measure,
terminator = terminator
)
instance$result_learner_param_vals
instance$result
Training and Prediction
Once we have our tuning parameters we can train our model and see how well it does on the validation data.
FinalTuned = FinalLearner$clone()
FinalTuned$param_set$values = instance$result_learner_param_vals
FinalTuned$param_set$values
FinalTuned$train(PriceTask, row_ids = Splits$train)$model
ResultsEL <- FinalTuned$predict(PriceTask, row_ids = Splits$test)
sqrt(mean((ResultsEL$truth - ResultsEL$response)^2))
ResultsEL$response
> sqrt(mean((ResultsEL$truth - ResultsEL$response)^2))
[1] 1953.374
> ResultsEL$response
[1] 10655.279 18972.114 6943.257 7702.912 9760.490 8187.215 9958.276 12557.192 10388.676 6906.860 7204.251 8714.980 9168.476 13263.504
[15] 10928.470 15843.457 15859.587 20448.614 19376.529 31221.496
Our RMSE is 1,953 and our predictions have been back transformed as intended.
Conclusion
MLR3 is a flexible machine learning package, and it also provides plenty of flexibility to use transformations on both the target and feature variables. Before setting up your own target transformations be sure to check whether there are pre-built pipe operators such as PCA, scale, Box-Cox, and Yeo-Johnson.
Related Articles
References
Bischl, B.; Sonabend, R.; et al. Applied Machine Learning Using MLR3 in R, CRC / Chapman Hall, 2024.
Recent Post
Customer Clusters with Gaussian Mixed Models
- 22 October 2024
- 8 min read
Text Sentiment Analysis with Hugging Face
- 28 September 2024
- 4 min read
Product Graph Analytics
- 21 August 2024
- 11 min read
MLR3 Pipeline Transformations
- 18 August 2024
- 6 min read
Promotional Lift with Bayesian Regression Trees
- 5 August 2024
- 10 min read