`step_best_normalize` creates a specification of a recipe step (see `recipes` package) that will transform data using the best of a suite of normalization transformations estimated (by default) using cross-validation.

step_best_normalize(
  recipe,
  ...,
  role = NA,
  trained = FALSE,
  transform_info = NULL,
  transform_options = list(),
  num_unique = 5,
  skip = FALSE,
  id = rand_id("best_normalize")
)

# S3 method for step_best_normalize
tidy(x, ...)

# S3 method for step_best_normalize
axe_env(x, ...)

Arguments

recipe

A formula or recipe

...

One or more selector functions to choose which variables are affected by the step. See [selections()] for more details. For the `tidy` method, these are not currently used.

role

Not used by this step since no new variables are created.

trained

For recipes functionality

transform_info

A numeric vector of transformation values. This (was transform_info) is `NULL` until computed by [prep.recipe()].

transform_options

options to be passed to bestNormalize

num_unique

An integer where data that have less possible values will not be evaluate for a transformation.

skip

For recipes functionality

id

For recipes functionality

x

A `step_best_normalize` object.

Value

An updated version of `recipe` with the new step added to the sequence of existing steps (if any). For the `tidy` method, a tibble with columns `terms` (the selectors or variables selected) and `value` (the lambda estimate).

Details

The bestnormalize transformation can be used to rescale a variable to be more similar to a normal distribution. See `?bestNormalize` for more information; `step_best_normalize` is the implementation of `bestNormalize` in the `recipes` context.

As of version 1.7, the `butcher` package can be used to (hopefully) improve scalability of this function on bigger data sets.

See also

bestNormalize orderNorm, [recipe()] [prep.recipe()] [bake.recipe()]

Examples


library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#> 
#>     filter, lag
#> The following objects are masked from ‘package:base’:
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: ‘recipes’
#> The following object is masked from ‘package:stats’:
#> 
#>     step
rec <- recipe(~ ., data = as.data.frame(iris))

bn_trans <- step_best_normalize(rec, all_numeric())

bn_estimates <- prep(bn_trans, training = as.data.frame(iris))

bn_data <- bake(bn_estimates, as.data.frame(iris))

plot(density(iris[, "Petal.Length"]), main = "before")

plot(density(bn_data$Petal.Length), main = "after")


tidy(bn_trans, number = 1)
#> # A tibble: 1 × 4
#>   terms         chosen cv_info id                  
#>   <chr>          <dbl>   <dbl> <chr>               
#> 1 all_numeric()     NA      NA best_normalize_Wm728
tidy(bn_estimates, number = 1)
#> # A tibble: 4 × 4
#>   terms        chosen_transform cv_info          id                  
#>   <chr>        <chr>            <named list>     <chr>               
#> 1 Sepal.Length sqrt_x           <tibble [9 × 4]> best_normalize_Wm728
#> 2 Sepal.Width  center_scale     <tibble [9 × 4]> best_normalize_Wm728
#> 3 Petal.Length orderNorm        <tibble [9 × 4]> best_normalize_Wm728
#> 4 Petal.Width  orderNorm        <tibble [9 × 4]> best_normalize_Wm728