Skip to contents

Preprocess & create a model matrix with interactions + polynomials

Usage

sparseR_prep(
  formula,
  data,
  k = 1,
  poly = 1,
  pre_proc_opts = c("knnImpute", "scale", "center", "otherbin", "none"),
  ia_formula = NULL,
  filter = c("nzv", "zv"),
  extra_opts = list(),
  family = "gaussian"
)

Arguments

formula

A formula of the main effects + outcome of the model

data

A required data frame or tibble containing the variables in formula

k

Maximum order of interactions to numeric variables

poly

the maximum order of polynomials to consider

pre_proc_opts

A character vector specifying methods for preprocessing (see details)

ia_formula

formula to be passed to step_interact (for interactions, see details)

filter

which methods should be used to filter out variables with (near) zero variance? (see details)

extra_opts

extra options to be used for preprocessing

family

family passed from sparseR

Value

an object of class recipe; see recipes::recipe()

Details

The pre_proc_opts acts as a wrapper for the corresponding procedures in the recipes package. The currently supported options that can be passed to pre_proc_opts are: knnImpute: Should k-nearest-neighbors be performed (if necessary?) scale: Should variables be scaled prior to creating interactions (does not scale factor variables or dummy variables) center: Should variables be centered (will not center factor variables or dummy variables ) otherbin:

ia_formula will by default interact all variables with each other up to order k. If specified, ia_formula will be passed as the terms argument to recipes::step_interact, so the help documentation for that function can be investigated for further assistance in specifying specific interactions.

The methods specified in filter are important; filtering is necessary to cut down on extraneous polynomials and interactions (in cases where they really don't make sense). This is true, for instance, when using dummy variables in polynomials , or when using interactions of dummy variables that relate to the same categorical variable.