Box-cox explained

What is the goal of Box-Cox transformation?

Many statistical tests and intervals are based on the assumption of normality. Unfortunately, many real data sets are in fact not approximately normal. However, an appropriate transformation of a data set can often yield a data set that does follow approximately a normal distribution. The box-Cox transformation is a particularly useful family of transformations. It’s defined as \[ T(Y)=\frac{Y^{\lambda}-1}{\lambda}\text{ when $\lambda\neq 0$}\\ T(Y)=log(Y)\text{ when $\lambda=0$} \] where \(Y\) is the response variable and \(\lambda\) is the transformation parameter.

\(Y\) is usually required to be positive, and when \(Y>-c\) where \(c\) is a positive constant, above transformation is written as \[ T(Y)=\frac{(Y+c)^{\lambda}-1}{\lambda}\text{ when $\lambda\neq 0$}\\ T(Y)=log(Y+c)\text{ when $\lambda=0$} \] The form (1) is slightly preferable for theoretical analysis because it’s continuous at \(\lambda=0\)

Linear models may not meet the assumptions of Ordinary Least Squares (OLS). When residuals are not Normally-distributed residuals, or it violates assumption of Homoscedastic where residuals should be of equal variance at each fitted value. Box-Cox transformation is used to stabilize the error variance and make residuals close to the normal distribution.

What is the goal of using MLE in Box-Cox transformation?

The goal of maximum likelihood estimation in the Box--Cox context is to find the transformation parameter \(\lambda\) that maximizes the likelihood under the assumption of normal residuals: \[ \hat{\lambda} = \arg\max_{\lambda} \log \mathcal{L}\!\left( \lambda \mid Y,\; X,\; \hat{\beta}(\lambda),\; \hat{\sigma}^2(\lambda) \right). \]

How is Box-Cox transformation derived?

Box and Cox (1964) developed a family of transformations designed to reduce nonnormality of the errors in a linear model and to stabilize heteroscedasticity. The linear model before transformation is: \[ Y_i = \beta_0 + \beta X_i + \varepsilon_i, \qquad \varepsilon_i \sim \mathcal{N}(0, \sigma^2). \]

The idea is to transform the response variable \(Y_i\) to a replacement variable \(Y_i^{(\lambda)}\), leaving the right-hand side of the regression model unchanged, so that regression residuals become approximately normal and have constant variance. The regression coefficients will change because the response variable has changed; therefore they must be interpreted with respect to the transformed scale.

The standard Box--Cox transformation is: \[ Y_i^{(\lambda)} = \begin{cases} \dfrac{Y_i^{\lambda} - 1}{\lambda}, & \lambda \neq 0,\\[8pt] \log(Y_i), & \lambda = 0. \end{cases} \]

The inverse transformation is: \[ Y_i = \begin{cases} \exp\!\left(\dfrac{\log(1+\lambda Y_i^{(\lambda)})}{\lambda}\right), & \lambda \neq 0,\\[10pt] \exp\!\left(Y_i^{(\lambda)}\right), & \lambda = 0. \end{cases} \] For a transformation \(Z = g(Y)\), the density of \(Z\) is given by the change-of-variable formula: \[ f_Z(z) = f_Y\!\left(g^{-1}(z)\right) \left| \frac{d g^{-1}(z)}{dz} \right|. \]

For the Box--Cox transformation: \[ \frac{dz}{dy} = y^{\lambda - 1}, \] and therefore the Jacobian determinant for \(n\) observations is: \[ J(\lambda) = \prod_{i=1}^n y_i^{\lambda - 1}. \] Assuming independent, normally distributed residuals after transformation, the log-likelihood of the linear model becomes: \[ \log \mathcal{L} = -\frac{n}{2}\log(2\pi) - n\log\sigma -\frac{1}{2\sigma^2} \sum_{i=1}^{n} \left[ Y_i^{(\lambda)} - (\beta_0 + \beta X_i) \right]^2 + (\lambda - 1)\sum_{i=1}^{n} \log Y_i. \] Starting from the full log-likelihood of the Box--Cox transformation:

\[ \log \mathcal{L} = -\frac{n}{2}\log(2\pi) - n\log\sigma -\frac{1}{2\sigma^2}\sum_{i=1}^{n}\left[Y_i^{(\lambda)} - (\beta_0 + \beta X_i)\right]^2 + (\lambda - 1)\sum_{i=1}^{n}\log Y_i. \]

When optimizing over \(\lambda\), the first two terms are constants (independent of \(\lambda\)), so we can omit them when forming the profile log-likelihood.


Step 1. Replace \(\sigma^2\) by its MLE

For a fixed \(\lambda\), the MLE for \(\sigma^2\) (can be easily obtained using \(\frac{\partial \log \mathcal{L}}{\partial \sigma}=0\)) is

\[ \hat{\sigma}^2_\lambda = \frac{1}{n}\sum_{i=1}^n \left[Y_i^{(\lambda)} - (\hat{\beta}_0 + \hat{\beta} X_i)\right]^2. \]

Plugging this back into the likelihood and removing constants gives

\[ \mathcal{L}(\lambda) \propto -\frac{n}{2}\log\!\left( \sum_{i=1}^n \left[Y_i^{(\lambda)} - (\hat{\beta}_0 + \hat{\beta} X_i)\right]^2 \right) + (\lambda - 1)\sum_{i=1}^n \log Y_i. \]


Step 2. Express \(\sum \log Y_i\) using the geometric mean

Define the geometric mean of the responses as

\[ \dot{Y} = \exp\!\left(\frac{1}{n}\sum_{i=1}^n \log Y_i\right), \]

so that

\[ \sum_{i=1}^n \log Y_i = n \log \dot{Y}. \]

Then the second term becomes

\[ (\lambda - 1)\sum_{i=1}^n \log Y_i = n(\lambda - 1)\log \dot{Y}. \]


Step 3. Combine terms inside the log

We can factor this constant into the residual sum of squares by dividing by \(\dot{Y}^{2(\lambda - 1)}\), giving:

\[ \mathcal{L}(\lambda) \propto -\frac{n}{2}\log\!\left[ \frac{1}{\dot{Y}^{2(\lambda - 1)}} \sum_{i=1}^n \left(Y_i^{(\lambda)} - (\hat{\beta}_0 + \hat{\beta} X_i)\right)^2 \right]. \]

After rearranging, this can also be written as:

\[ \mathcal{L}(\lambda) \propto -\frac{n}{2}\log\!\left[ \sum_{i=1}^n \left(\frac{Y_i^{(\lambda)} - (\hat{\beta}_0 + \hat{\beta} X_i)}{\dot{Y}^\lambda}\right)^2 \right]. \]


Step 4. Interpretation

  • \(\dot{Y}\) is the geometric mean of the responses.
  • The scaling by \(\dot{Y}^\lambda\) makes the likelihood dimensionless, removing dependence on measurement units.
  • The proportionality symbol \(\propto\) indicates that constants (such as \(2\pi\)) are omitted because they do not affect the optimal \(\lambda\).

Thus, the expression

\[ \mathcal{L(\lambda)} \propto -\frac{n}{2}\log\!\left[ \sum_{i=1}^n \left(\frac{Y_i^{(\lambda)} - (\hat{\beta}_0 + \hat{\beta} X_i)}{\dot{Y}^\lambda}\right)^2 \right] \]

is mathematically equivalent to the standard Box--Cox log-likelihood, up to constants.

How to obtain \(\beta_0\) and \(\beta\) for a given \(\lambda\)?

For a given \(\lambda\), compute \(Z_i(\lambda)\) and apply OLS. The estimators with one regressor are \[ \hat\beta(\lambda)= \frac{\sum_{i=1}^n (X_i-\bar X)\,\big(Z_i(\lambda)-\bar Z(\lambda)\big)} {\sum_{i=1}^n (X_i-\bar X)^2}\\ \hat\beta_0(\lambda)=\bar Z(\lambda)-\hat\beta(\lambda)\,\bar X \] where \(\bar X=\frac1n\sum_i X_i\) and \(\bar Z(\lambda)=\frac1n\sum_i Z_i(\lambda)\).

Matrix form: \[ \hat{\boldsymbol\beta}(\lambda) = (X^\top X)^{-1} X^\top \mathbf{Z}(\lambda). \]

Is \(\sigma\) different for each given \(\lambda\)?

Yes. For each \(\lambda\) you fit a different regression model on \(Z_i(\lambda)\).

The MLE of \(\sigma^2\) is \[ \hat\sigma^2(\lambda) = \frac{1}{n} \sum_{i=1}^n \big(Z_i(\lambda)-\hat\beta_0(\lambda)-\hat\beta(\lambda)X_i\big)^2 = \frac{\mathrm{RSS}(\lambda)}{n}. \] Thus \(\hat\sigma^2\) changes with \(\lambda\).

Are \(Y_i\) iid normal? If not, why?

No.

Assumption: the errors on the transformed scale are iid normal:

\[ Z_i(\lambda)\mid X_i \sim \mathcal{N}(\beta_0+\beta X_i,\sigma^2). \]

Because the mean depends on \(X_i\), the responses are independent but not identically distributed.
On the original scale, the \(Y_i\) are generally not normal; the transformation is used to approximate normality and stabilize variance.

Reference