The Ordered Quantile (ORQ) normalization transformation, orderNorm(), is a rank-based procedure by which the values of a vector are mapped to their percentile, which is then mapped to the same percentile of the normal distribution. Without the presence of ties, this essentially guarantees that the transformation leads to a uniform distribution.

The transformation is: $$g(x) = \Phi ^ {-1} ((rank(x) - .5) / (length(x)))$$

Where $$\Phi$$ refers to the standard normal cdf, rank(x) refers to each observation's rank, and length(x) refers to the number of observations.

By itself, this method is certainly not new; the earliest mention of it that I could find is in a 1947 paper by Bartlett (see references). This formula was outlined explicitly in Van der Waerden, and expounded upon in Beasley (2009). However there is a key difference to this version of it, as explained below.

Using linear interpolation between these percentiles, the ORQ normalization becomes a 1-1 transformation that can be applied to new data. However, outside of the observed domain of x, it is unclear how to extrapolate the transformation. In the ORQ normalization procedure, a binomial glm with a logit link is used on the ranks in order to extrapolate beyond the bounds of the original domain of x. The inverse normal CDF is then applied to these extrapolated predictions in order to extrapolate the transformation. This mitigates the influence of heavy-tailed distributions while preserving the 1-1 nature of the transformation. The extrapolation will provide a warning unless warn = FALSE.) However, we found that the extrapolation was able to perform very well even on data as heavy-tailed as a Cauchy distribution (paper to be published).

The fit used to perform the extrapolation uses a default of 10000 observations (or length(x) if that is less). This added approximation improves the scalability, both computationally and in terms of memory used. Do not set this value to be too low (e.g. <100), as there is no benefit to doing so. Increase if your test data set is large relative to 10000 and/or if you are worried about losing signal in the extremes of the range.

This transformation can be performed on new data and inverted via the predict function.

orderNorm(x, n_logit_fit = min(length(x), 10000), ..., warn = TRUE)

# S3 method for orderNorm
predict(object, newdata = NULL, inverse = FALSE, warn = TRUE, ...)

# S3 method for orderNorm
print(x, ...)

## Arguments

x

A vector to normalize

n_logit_fit

Number of points used to fit logit approximation

...

warn

transforms outside observed range or ties will yield warning

object

an object of class 'orderNorm'

newdata

a vector of data to be (reverse) transformed

inverse

if TRUE, performs reverse transformation

## Value

A list of class orderNorm with elements

x.t

transformed original data

x

original data

n

number of nonmissing observations

ties_status

indicator if ties are present

fit

fit to be used for extrapolation, if needed

norm_stat

Pearson's P / degrees of freedom

The predict function returns the numeric value of the transformation performed on new data, and allows for the inverse transformation as well.

boxcox, lambert, bestNormalize, yeojohnson

## Examples


x <- rgamma(100, 1, 1)

orderNorm_obj <- orderNorm(x)
orderNorm_obj
#> orderNorm Transformation with 100 nonmissing obs and no ties
#>  - Original quantiles:
#>    0%   25%   50%   75%  100%
#> 0.001 0.313 0.808 1.424 5.744
p <- predict(orderNorm_obj)
x2 <- predict(orderNorm_obj, newdata = p, inverse = TRUE)

all.equal(x2, x)
#>  TRUE