Box-cox explained

Posted on 2025-11-14 In statistics

What is the goal of Box-Cox transformation?

Many statistical tests and intervals are based on the assumption of normality. Unfortunately, many real data sets are in fact not approximately normal. However, an appropriate transformation of a data set can often yield a data set that does follow approximately a normal distribution. The box-Cox transformation is a particularly useful family of transformations. It’s defined as \[ T(Y)=\frac{Y^{\lambda}-1}{\lambda}\text{ when $\lambda\neq 0$}\\ T(Y)=log(Y)\text{ when $\lambda=0$} \] where $Y$ is the response variable and $\lambda$ is the transformation parameter.

$Y$ is usually required to be positive, and when $Y>-c$ where $c$ is a positive constant, above transformation is written as \[ T(Y)=\frac{(Y+c)^{\lambda}-1}{\lambda}\text{ when $\lambda\neq 0$}\\ T(Y)=log(Y+c)\text{ when $\lambda=0$} \] The form (1) is slightly preferable for theoretical analysis because it’s continuous at $\lambda=0$

Linear models may not meet the assumptions of Ordinary Least Squares (OLS). When residuals are not Normally-distributed residuals, or it violates assumption of Homoscedastic where residuals should be of equal variance at each fitted value. Box-Cox transformation is used to stabilize the error variance and make residuals close to the normal distribution.

What is the goal of using MLE in Box-Cox transformation?

The goal of maximum likelihood estimation in the Box--Cox context is to find the transformation parameter $\lambda$ that maximizes the likelihood under the assumption of normal residuals: \[ \hat{\lambda} = \arg\max_{\lambda} \log \mathcal{L}\!\left( \lambda \mid Y,\; X,\; \hat{\beta}(\lambda),\; \hat{\sigma}^2(\lambda) \right). \]

How is Box-Cox transformation derived?

Box and Cox (1964) developed a family of transformations designed to reduce nonnormality of the errors in a linear model and to stabilize heteroscedasticity. The linear model before transformation is: \[ Y_i = \beta_0 + \beta X_i + \varepsilon_i, \qquad \varepsilon_i \sim \mathcal{N}(0, \sigma^2). \]

The idea is to transform the response variable $Y_i$ to a replacement variable $Y_i^{(\lambda)}$, leaving the right-hand side of the regression model unchanged, so that regression residuals become approximately normal and have constant variance. The regression coefficients will change because the response variable has changed; therefore they must be interpreted with respect to the transformed scale.

The standard Box--Cox transformation is: \[ Y_i^{(\lambda)} = \begin{cases} \dfrac{Y_i^{\lambda} - 1}{\lambda}, & \lambda \neq 0,\\[8pt] \log(Y_i), & \lambda = 0. \end{cases} \]

The inverse transformation is: \[ Y_i = \begin{cases} \exp\!\left(\dfrac{\log(1+\lambda Y_i^{(\lambda)})}{\lambda}\right), & \lambda \neq 0,\\[10pt] \exp\!\left(Y_i^{(\lambda)}\right), & \lambda = 0. \end{cases} \] For a transformation $Z = g(Y)$, the density of $Z$ is given by the change-of-variable formula: \[ f_Z(z) = f_Y\!\left(g^{-1}(z)\right) \left| \frac{d g^{-1}(z)}{dz} \right|. \]

For the Box--Cox transformation: \[ \frac{dz}{dy} = y^{\lambda - 1}, \] and therefore the Jacobian determinant for $n$ observations is: \[ J(\lambda) = \prod_{i=1}^n y_i^{\lambda - 1}. \] Assuming independent, normally distributed residuals after transformation, the log-likelihood of the linear model becomes: \[ \log \mathcal{L} = -\frac{n}{2}\log(2\pi) - n\log\sigma -\frac{1}{2\sigma^2} \sum_{i=1}^{n} \left[ Y_i^{(\lambda)} - (\beta_0 + \beta X_i) \right]^2 + (\lambda - 1)\sum_{i=1}^{n} \log Y_i. \] Starting from the full log-likelihood of the Box--Cox transformation:

\[ \log \mathcal{L} = -\frac{n}{2}\log(2\pi) - n\log\sigma -\frac{1}{2\sigma^2}\sum_{i=1}^{n}\left[Y_i^{(\lambda)} - (\beta_0 + \beta X_i)\right]^2 + (\lambda - 1)\sum_{i=1}^{n}\log Y_i. \]

When optimizing over $\lambda$, the first two terms are constants (independent of $\lambda$), so we can omit them when forming the profile log-likelihood.

Step 1. Replace $\sigma^2$ by its MLE

For a fixed $\lambda$, the MLE for $\sigma^2$ (can be easily obtained using $\frac{\partial \log \mathcal{L}}{\partial \sigma}=0$) is

\[ \hat{\sigma}^2_\lambda = \frac{1}{n}\sum_{i=1}^n \left[Y_i^{(\lambda)} - (\hat{\beta}_0 + \hat{\beta} X_i)\right]^2. \]

Plugging this back into the likelihood and removing constants gives

\[ \mathcal{L}(\lambda) \propto -\frac{n}{2}\log\!\left( \sum_{i=1}^n \left[Y_i^{(\lambda)} - (\hat{\beta}_0 + \hat{\beta} X_i)\right]^2 \right) + (\lambda - 1)\sum_{i=1}^n \log Y_i. \]

Step 2. Express $\sum \log Y_i$ using the geometric mean

Define the geometric mean of the responses as

\[ \dot{Y} = \exp\!\left(\frac{1}{n}\sum_{i=1}^n \log Y_i\right), \]

so that

\[ \sum_{i=1}^n \log Y_i = n \log \dot{Y}. \]

Then the second term becomes

\[ (\lambda - 1)\sum_{i=1}^n \log Y_i = n(\lambda - 1)\log \dot{Y}. \]

Step 3. Combine terms inside the log

We can factor this constant into the residual sum of squares by dividing by $\dot{Y}^{2(\lambda - 1)}$, giving:

\[ \mathcal{L}(\lambda) \propto -\frac{n}{2}\log\!\left[ \frac{1}{\dot{Y}^{2(\lambda - 1)}} \sum_{i=1}^n \left(Y_i^{(\lambda)} - (\hat{\beta}_0 + \hat{\beta} X_i)\right)^2 \right]. \]

After rearranging, this can also be written as:

\[ \mathcal{L}(\lambda) \propto -\frac{n}{2}\log\!\left[ \sum_{i=1}^n \left(\frac{Y_i^{(\lambda)} - (\hat{\beta}_0 + \hat{\beta} X_i)}{\dot{Y}^\lambda}\right)^2 \right]. \]

Step 4. Interpretation

$\dot{Y}$ is the geometric mean of the responses.
The scaling by $\dot{Y}^\lambda$ makes the likelihood dimensionless, removing dependence on measurement units.
The proportionality symbol $\propto$ indicates that constants (such as $2\pi$) are omitted because they do not affect the optimal $\lambda$.

Thus, the expression

\[ \mathcal{L(\lambda)} \propto -\frac{n}{2}\log\!\left[ \sum_{i=1}^n \left(\frac{Y_i^{(\lambda)} - (\hat{\beta}_0 + \hat{\beta} X_i)}{\dot{Y}^\lambda}\right)^2 \right] \]

is mathematically equivalent to the standard Box--Cox log-likelihood, up to constants.

How to obtain $\beta_0$ and $\beta$ for a given $\lambda$?

For a given $\lambda$, compute $Z_i(\lambda)$ and apply OLS. The estimators with one regressor are \[ \hat\beta(\lambda)= \frac{\sum_{i=1}^n (X_i-\bar X)\,\big(Z_i(\lambda)-\bar Z(\lambda)\big)} {\sum_{i=1}^n (X_i-\bar X)^2}\\ \hat\beta_0(\lambda)=\bar Z(\lambda)-\hat\beta(\lambda)\,\bar X \] where $\bar X=\frac1n\sum_i X_i$ and $\bar Z(\lambda)=\frac1n\sum_i Z_i(\lambda)$.

Matrix form: \[ \hat{\boldsymbol\beta}(\lambda) = (X^\top X)^{-1} X^\top \mathbf{Z}(\lambda). \]

Is $\sigma$ different for each given $\lambda$?

Yes. For each $\lambda$ you fit a different regression model on $Z_i(\lambda)$.

The MLE of $\sigma^2$ is \[ \hat\sigma^2(\lambda) = \frac{1}{n} \sum_{i=1}^n \big(Z_i(\lambda)-\hat\beta_0(\lambda)-\hat\beta(\lambda)X_i\big)^2 = \frac{\mathrm{RSS}(\lambda)}{n}. \] Thus $\hat\sigma^2$ changes with $\lambda$.

Are $Y_i$ iid normal? If not, why?

No.

Assumption: the errors on the transformed scale are iid normal:

\[ Z_i(\lambda)\mid X_i \sim \mathcal{N}(\beta_0+\beta X_i,\sigma^2). \]

Because the mean depends on $X_i$, the responses are independent but not identically distributed.
On the original scale, the $Y_i$ are generally not normal; the transformation is used to approximate normality and stabilize variance.

Box-cox explained

What is the goal of Box-Cox transformation?

What is the goal of using MLE in Box-Cox transformation?

How is Box-Cox transformation derived?

How to obtain \(\beta_0\) and \(\beta\) for a given \(\lambda\)?

Is \(\sigma\) different for each given \(\lambda\)?

Are \(Y_i\) iid normal? If not, why?

Reference