0%

What is the goal of Box-Cox transformation?

Many statistical tests and intervals are based on the assumption of normality. Unfortunately, many real data sets are in fact not approximately normal. However, an appropriate transformation of a data set can often yield a data set that does follow approximately a normal distribution. The box-Cox transformation is a particularly useful family of transformations. It’s defined as \[ T(Y)=\frac{Y^{\lambda}-1}{\lambda}\text{ when $\lambda\neq 0$}\\ T(Y)=log(Y)\text{ when $\lambda=0$} \] where \(Y\) is the response variable and \(\lambda\) is the transformation parameter.

\(Y\) is usually required to be positive, and when \(Y>-c\) where \(c\) is a positive constant, above transformation is written as \[ T(Y)=\frac{(Y+c)^{\lambda}-1}{\lambda}\text{ when $\lambda\neq 0$}\\ T(Y)=log(Y+c)\text{ when $\lambda=0$} \] The form (1) is slightly preferable for theoretical analysis because it’s continuous at \(\lambda=0\)

Linear models may not meet the assumptions of Ordinary Least Squares (OLS). When residuals are not Normally-distributed residuals, or it violates assumption of Homoscedastic where residuals should be of equal variance at each fitted value. Box-Cox transformation is used to stabilize the error variance and make residuals close to the normal distribution.

What is the goal of using MLE in Box-Cox transformation?

The goal of maximum likelihood estimation in the Box--Cox context is to find the transformation parameter \(\lambda\) that maximizes the likelihood under the assumption of normal residuals: \[ \hat{\lambda} = \arg\max_{\lambda} \log \mathcal{L}\!\left( \lambda \mid Y,\; X,\; \hat{\beta}(\lambda),\; \hat{\sigma}^2(\lambda) \right). \]

How is Box-Cox transformation derived?

Box and Cox (1964) developed a family of transformations designed to reduce nonnormality of the errors in a linear model and to stabilize heteroscedasticity. The linear model before transformation is: \[ Y_i = \beta_0 + \beta X_i + \varepsilon_i, \qquad \varepsilon_i \sim \mathcal{N}(0, \sigma^2). \]

The idea is to transform the response variable \(Y_i\) to a replacement variable \(Y_i^{(\lambda)}\), leaving the right-hand side of the regression model unchanged, so that regression residuals become approximately normal and have constant variance. The regression coefficients will change because the response variable has changed; therefore they must be interpreted with respect to the transformed scale.

The standard Box--Cox transformation is: \[ Y_i^{(\lambda)} = \begin{cases} \dfrac{Y_i^{\lambda} - 1}{\lambda}, & \lambda \neq 0,\\[8pt] \log(Y_i), & \lambda = 0. \end{cases} \]

The inverse transformation is: \[ Y_i = \begin{cases} \exp\!\left(\dfrac{\log(1+\lambda Y_i^{(\lambda)})}{\lambda}\right), & \lambda \neq 0,\\[10pt] \exp\!\left(Y_i^{(\lambda)}\right), & \lambda = 0. \end{cases} \] For a transformation \(Z = g(Y)\), the density of \(Z\) is given by the change-of-variable formula: \[ f_Z(z) = f_Y\!\left(g^{-1}(z)\right) \left| \frac{d g^{-1}(z)}{dz} \right|. \]

For the Box--Cox transformation: \[ \frac{dz}{dy} = y^{\lambda - 1}, \] and therefore the Jacobian determinant for \(n\) observations is: \[ J(\lambda) = \prod_{i=1}^n y_i^{\lambda - 1}. \] Assuming independent, normally distributed residuals after transformation, the log-likelihood of the linear model becomes: \[ \log \mathcal{L} = -\frac{n}{2}\log(2\pi) - n\log\sigma -\frac{1}{2\sigma^2} \sum_{i=1}^{n} \left[ Y_i^{(\lambda)} - (\beta_0 + \beta X_i) \right]^2 + (\lambda - 1)\sum_{i=1}^{n} \log Y_i. \] Starting from the full log-likelihood of the Box--Cox transformation:

\[ \log \mathcal{L} = -\frac{n}{2}\log(2\pi) - n\log\sigma -\frac{1}{2\sigma^2}\sum_{i=1}^{n}\left[Y_i^{(\lambda)} - (\beta_0 + \beta X_i)\right]^2 + (\lambda - 1)\sum_{i=1}^{n}\log Y_i. \]

When optimizing over \(\lambda\), the first two terms are constants (independent of \(\lambda\)), so we can omit them when forming the profile log-likelihood.


Step 1. Replace \(\sigma^2\) by its MLE

For a fixed \(\lambda\), the MLE for \(\sigma^2\) (can be easily obtained using \(\frac{\partial \log \mathcal{L}}{\partial \sigma}=0\)) is

\[ \hat{\sigma}^2_\lambda = \frac{1}{n}\sum_{i=1}^n \left[Y_i^{(\lambda)} - (\hat{\beta}_0 + \hat{\beta} X_i)\right]^2. \]

Plugging this back into the likelihood and removing constants gives

\[ \mathcal{L}(\lambda) \propto -\frac{n}{2}\log\!\left( \sum_{i=1}^n \left[Y_i^{(\lambda)} - (\hat{\beta}_0 + \hat{\beta} X_i)\right]^2 \right) + (\lambda - 1)\sum_{i=1}^n \log Y_i. \]


Step 2. Express \(\sum \log Y_i\) using the geometric mean

Define the geometric mean of the responses as

\[ \dot{Y} = \exp\!\left(\frac{1}{n}\sum_{i=1}^n \log Y_i\right), \]

so that

\[ \sum_{i=1}^n \log Y_i = n \log \dot{Y}. \]

Then the second term becomes

\[ (\lambda - 1)\sum_{i=1}^n \log Y_i = n(\lambda - 1)\log \dot{Y}. \]


Step 3. Combine terms inside the log

We can factor this constant into the residual sum of squares by dividing by \(\dot{Y}^{2(\lambda - 1)}\), giving:

\[ \mathcal{L}(\lambda) \propto -\frac{n}{2}\log\!\left[ \frac{1}{\dot{Y}^{2(\lambda - 1)}} \sum_{i=1}^n \left(Y_i^{(\lambda)} - (\hat{\beta}_0 + \hat{\beta} X_i)\right)^2 \right]. \]

After rearranging, this can also be written as:

\[ \mathcal{L}(\lambda) \propto -\frac{n}{2}\log\!\left[ \sum_{i=1}^n \left(\frac{Y_i^{(\lambda)} - (\hat{\beta}_0 + \hat{\beta} X_i)}{\dot{Y}^\lambda}\right)^2 \right]. \]


Step 4. Interpretation

  • \(\dot{Y}\) is the geometric mean of the responses.
  • The scaling by \(\dot{Y}^\lambda\) makes the likelihood dimensionless, removing dependence on measurement units.
  • The proportionality symbol \(\propto\) indicates that constants (such as \(2\pi\)) are omitted because they do not affect the optimal \(\lambda\).

Thus, the expression

\[ \mathcal{L(\lambda)} \propto -\frac{n}{2}\log\!\left[ \sum_{i=1}^n \left(\frac{Y_i^{(\lambda)} - (\hat{\beta}_0 + \hat{\beta} X_i)}{\dot{Y}^\lambda}\right)^2 \right] \]

is mathematically equivalent to the standard Box--Cox log-likelihood, up to constants.

How to obtain \(\beta_0\) and \(\beta\) for a given \(\lambda\)?

For a given \(\lambda\), compute \(Z_i(\lambda)\) and apply OLS. The estimators with one regressor are \[ \hat\beta(\lambda)= \frac{\sum_{i=1}^n (X_i-\bar X)\,\big(Z_i(\lambda)-\bar Z(\lambda)\big)} {\sum_{i=1}^n (X_i-\bar X)^2}\\ \hat\beta_0(\lambda)=\bar Z(\lambda)-\hat\beta(\lambda)\,\bar X \] where \(\bar X=\frac1n\sum_i X_i\) and \(\bar Z(\lambda)=\frac1n\sum_i Z_i(\lambda)\).

Matrix form: \[ \hat{\boldsymbol\beta}(\lambda) = (X^\top X)^{-1} X^\top \mathbf{Z}(\lambda). \]

Is \(\sigma\) different for each given \(\lambda\)?

Yes. For each \(\lambda\) you fit a different regression model on \(Z_i(\lambda)\).

The MLE of \(\sigma^2\) is \[ \hat\sigma^2(\lambda) = \frac{1}{n} \sum_{i=1}^n \big(Z_i(\lambda)-\hat\beta_0(\lambda)-\hat\beta(\lambda)X_i\big)^2 = \frac{\mathrm{RSS}(\lambda)}{n}. \] Thus \(\hat\sigma^2\) changes with \(\lambda\).

Are \(Y_i\) iid normal? If not, why?

No.

Assumption: the errors on the transformed scale are iid normal:

\[ Z_i(\lambda)\mid X_i \sim \mathcal{N}(\beta_0+\beta X_i,\sigma^2). \]

Because the mean depends on \(X_i\), the responses are independent but not identically distributed.
On the original scale, the \(Y_i\) are generally not normal; the transformation is used to approximate normality and stabilize variance.

Reference

Multicollinearity arises in the polynomial regression. As a special case of the linear model, two problems occurs, the first is the multicollinearity, the second is that if power \(k\) is large, say more than \(3\) or \(4\), the magnitudes of these powers tend to vary over a rather wide range.

To deal with these problems, \(2\) strategies are needed. One is to limit to a cubic (\(k=3\)) if at all possible, and generally no more than a quintic (\(k=5\)). The other is to center the \(x\)-variable. In other words, fit the model \[ y=\beta_0^*+\beta_1^*(x-\bar{x})+\cdots+\beta_k^*(x-\bar{x})^k \] In addition to centering, one could also scale the \(x\) variable by dividing by its standard deviation \(s_x\), thus standardizing it to have a zero mean and a unit standard deviation.

Reference

  • Statistics and data analysis : from elementary to intermediate (page 418)

When approximating the posterior distribution, one tends to ignore the normalizing constant in the equation as \[ \pi(\theta|x)=\frac{\pi(x|\theta)\pi(\theta)}{\int\pi(x|\theta)\pi(\theta)d\theta}\\ =\frac{\pi(x|\theta)\pi(\theta)}{\pi(x)}\\ \propto\pi(x|\theta)\pi(\theta) \] The fact is, not all the MCMC methods avoid the need for the normalizing constant. However, many of them do (such as the Metropolis-Hastings algorithm), since the iteration process is based on the ratio \[ \begin{gather*} R(\theta_1,\theta_2)=\frac{\pi(\theta_1|x)}{\pi(\theta_2|x)}\\ =\frac{\pi(x|\theta_1)\pi(\theta_1)}{\pi(x)}/\frac{\pi(x|\theta_2)\pi(\theta_2)}{\pi(x)}\\ =\frac{\pi(x|\theta_1)\pi(\theta_1)}{\pi(x|\theta_2)\pi(\theta_2)} \end{gather*} \] You can easily observe that \(R(\theta_1,\theta_2)\) does not involve the normalizing constant, so actually you can cancels out the normalizing constant \(\pi(x)\) when approximating the posterior distribution \(\pi(\theta|x)\). Therefore, you can easily state that \(\pi(\theta|x)\) is proportional to the \(\pi(x|\theta)\pi(\theta)\) and utilize this information to approximate the posterior distribution.

Reference

Normalizing constant irrelevant in Bayes Theorem?

Rationale behind ignoring the “denominator” in Bayes Rule

Note: This section contains notes on the derivations made while authoring my book. The derivation details originate from public Q&A platforms and will naturally differ from those in the book chapters.

Column 1 Column 2 Total
Row 1 \(a\) \(b\) \(n_{1\cdot}=a+b\)
Row 2 \(c\) \(d\) \(n_{2\cdot}=c+d\)
Total \(n_{\cdot 1}\) \(n_{\cdot 2}\) \(N\)

The goal is to approximate the variance of log odds-ratio \(VAR[ln(OR)]\) when \(OR\) is expressed as \[ OR=\frac{p_1/(1-p_1)}{p_2/(1-p_2)} \] where \(p_1\) and \(p_2\) are two probabilities for group 1 and group 2, respectively.

Suppose \(Y=f(X)\), then we can use delta method to approximate the \(f(X)\) if \(E(X)=\mu\), as \[ f(X)\approx f(\mu)+f'(\mu)(X-\mu)+\frac{1}{2}f''(\mu)(X-\mu)^2 \] Assume \(VAR(X)=\sigma^2\), we have \[ \begin{gather*} E(Y)\approx E(f(X))=f(\mu)+\frac{1}{2}f''(\mu)\sigma^2\\ VAR(Y)=VAR[f(x)]\approx [f'(\mu)]^2VAR(X) \end{gather*} \]

Note, if \(E[Y]=E[f(X)]\) can be obtained using transformed density function, then delta method is not needed in approximating the \(Y=f(X)\)

Let \(Y=ln(\hat{OR})\); since group 1 and group 2 are independent, we have \[ \begin{gather*} VAR(ln(\hat{OR}))=VAR[ln(\frac{\hat{p}_1/(1-\hat{p}_1)}{\hat{p}_2/(1-\hat{p}_2)})]\\ =VAR[ln(\frac{\hat{p}_1}{1-\hat{p}_1})]+VAR[ln(\frac{\hat{p}_2}{1-\hat{p}_2})]\\ =(\frac{1}{\hat{p}_1(1-\hat{p}_1)})^2\frac{\hat{p}_1(1-\hat{p}_1)}{n_{1\cdot}}+(\frac{1}{\hat{p}_2(1-\hat{p}_2)})^2\frac{\hat{p}_2(1-\hat{p}_2)}{n_{2\cdot}}\\ =\frac{1}{n_{1\cdot}\hat{p}_1(1-\hat{p}_1)}+\frac{1}{n_{2\cdot}\hat{p}_2(1-\hat{p}_2)}\\ =\frac{1}{n_{1\cdot}\hat{p}_1}+\frac{1}{n_{1\cdot}(1-\hat{p}_1)}+\frac{1}{n_{2\cdot}\hat{p}_2}+\frac{1}{n_{2\cdot}(1-\hat{p}_2)}\\ =\frac{1}{a}+\frac{1}{b}+\frac{1}{c}+\frac{1}{d} \end{gather*} \] Where \(n_{1\cdot},n_{2\cdot}\) are counts for group 1 and group 2, respectively; \(a,b,c,d\) are 4 counts for the \(2\times 2\) contingency table.

Reference

How to calculate the standard deviation of the log-odds?

Note: This section contains notes on the derivations made while authoring my book. The derivation details originate from public Q&A platforms and will naturally differ from those in the book chapters.

The goal is to prove the marginal distribution of data \(n\) is a negative-binomial distribution when the likelihood is Poisson distribution.

Suppose count \(n\) is drawn from a Poisson distribution \[ n\sim Poisson(\mu) \] where \(\lambda=\mu/E\sim \Gamma(\alpha,\beta)\),\(\beta\) is the rate.

Re-express the distribution of \(n\), as \[ \begin{gather*} n|\lambda\sim Poisson(\lambda E)\\ f(n|\lambda,E)=\frac{(\lambda E)^n}{n!}e^{-\lambda E} \end{gather*} \] Suppose \(\alpha,\beta\) are known. Taking all possible values of \(\lambda\), we could obtain the marginal distribution of \(n\) as \[ \begin{gather*} f(n|\alpha,\beta)=\int_0^{\infty}f(n|\lambda)\pi(\lambda)d\lambda\\ =\int_0^{\infty}\frac{(\lambda E)^n}{n!}e^{-\lambda E}\frac{(\beta \lambda)^{\alpha -1}e^{-\lambda \beta}}{\Gamma(\alpha)}\beta d\lambda\\ =\frac{E^n\beta^{\alpha}}{n!\Gamma(\alpha)}\int_0^{\infty}\lambda^ne^{-\lambda E}\lambda^{\alpha-1}e^{-\lambda \beta}d\lambda\\ =\frac{E^n\beta^{\alpha}}{n!\Gamma(\alpha)}\int_0^{\infty}e^{-\lambda(E+\beta)}\lambda^{n+\alpha-1}d\lambda\\ =\frac{E^n\beta^{\alpha}\Gamma(\alpha+n)}{n!\Gamma(\alpha)(\beta+E)^{n+\alpha}}\int_0^{\infty}\frac{(\beta+E)^{n+\alpha}}{\Gamma(\alpha+n)}e^{-\lambda(E+\beta)}\lambda^{n+\alpha-1}d\lambda\\ =\frac{1}{n!\Gamma(\alpha)}(\frac{E}{\beta+E})^n(\frac{\beta}{\beta+E})^{\alpha}\Gamma(\alpha+n)*1\\ =\frac{\Gamma(\alpha+n)}{n!\Gamma(\alpha)}(\frac{\beta}{E}+1)^{-n}(\frac{E}{\beta}+1)^{-\alpha} \end{gather*} \] \(f(n|\alpha,\beta)\) is exactly the density function of the negative Binomial distribution, and is also the likelihood function taking \(\alpha,\beta\) as parameters.

Based on the details provided in the derivation, it can also be stated that the negative binomial distribution is essentially a Gamma-Poisson mixture distribution, as it integrates aspects of both the Gamma and Poisson distributions.

It’s also readily to prove that the posterior predictive distribution of data \(n\) as \(f(\hat{n}|n)\) is a negative binomial distribution. Since the prior of \(\lambda\) is a Gamma distribution, the posterior of \(\lambda\) given \(n\) is also a Gamma distribution. The likelihood of \(\hat{n}\) given unknown but fixed \(\lambda\) is a Poisson distribution. Then we could obtain the posterior predictive distribution of \(\hat{n}\) given \(n\) as \[ \begin{gather*} p(\hat{n}|n)=\int_0^{\infty}p(\hat{n}|\lambda)p(\lambda|n)d\lambda\\ =\int_0^{\infty}(Poisson\times Gamma) d\lambda \end{gather*} \] Thus we show that \(\hat{n}|n\) is a Gamma-Poisson mixture distribution.

Note: This section contains notes on the derivations made while authoring my book. The derivation details originate from public Q&A platforms and will naturally differ from those in the book chapters.

The goal is to calculate the expected value of the logarithm of gamma distribution, which is \[ E[lnX] \] Where \(X\sim Gamma(\alpha,\beta)\)

Part I

To deal with the problem, we need to first figure out how to calculate the density of the transformed variable, which is the density \(f_Y(y)\) when \(Y=g(X)\).

Suppose \(Y=g(X)\), \(X\) is a random variable on probability space \((\Omega,\mathcal{F},\mathcal{P})\), and has continuous density \(f\).

\(\Rightarrow Y=g(X)\) has the distribution function \(F(g^{-1}(y))\).

Taking derivatives via the chain rule, we have \[ \begin{gather*} f_Y(y)=\frac{dF(g^{-1}(y))}{dy}\\=\frac{dF(g^{-1}(y))}{dx}\frac{dx}{dy}\\ =\frac{dF(g^{-1}(y))}{dx}\frac{d[g^{-1}(y)]}{dy}\\ =f(g^{-1}(y))\frac{d[g^{-1}(y)]}{dy} \end{gather*} \] Recall that the derivative of an inverse function is \[ \begin{gather*} \frac{d}{dy}[g^{-1}(y)]\\ =\frac{dx}{d[g(g^{-1}(y))]}\\ =\frac{1}{g'(g^{-1}(y))} \end{gather*} \] The density function of \(g(X)\) is then expressed as \[ \begin{gather*} f_Y(y)=f(g^{-1}(y))\frac{1}{g'(g^{-1}(y))}\\ =\frac{f(g^{-1}(y))}{g'(g^{-1}(y))} \end{gather*} \]

Part II

Now that we have figured out how to calculate the density of the transformed variable, we can proceed to obtain the density of \(Y=lnX\).

Since \(X\sim \Gamma(\alpha,\beta)\), \(Y=g(X)=ln(X)\), we have \(X=e^y\). Because \(\beta\) is a scale parameter, its effect will be to shift the logarithm by \(log\beta\). Specifically, it will shift the logarithm by \(-log\beta\). So we can readily start with the case when \(\beta=1\).

That gives us the distribution of \(X\) as \[ f_X(x)=\frac{1}{\Gamma(\alpha)}x^{\alpha}e^{-x}\frac{1}{x} \] Where \(\Gamma(\alpha)=\int_0^{\infty}x^{\alpha}e^{-x}\frac{dx}{x}\) is the normalizing constant.

Substituting \(x=e^y\) entails \(dy=\frac{dx}{x}\) as \(\frac{de^y}{dy}=e^y=x\), thus we have \[ \begin{gather*} f_Y(y)=\frac{f(g^{-1}(y))}{g'(g^{-1}(y))}\\ =\frac{1}{\Gamma(\alpha)}e^{\alpha y}e^{-e^y}\frac{1}{e^y}/(\frac{1}{e^y})\\ =\frac{1}{\Gamma(\alpha)}e^{\alpha y-e^y} \end{gather*} \] The possible values of \(Y\) now range over all the real numbers \(\mathbb{R}\).

Part III

As \(f_Y\) must integrate to unity, as \[ \begin{gather*} \int_{\mathbb{R}}f_Y(y)dy=\int_{\mathbb{R}}\frac{1}{\Gamma(\alpha)}e^{\alpha y-e^y}dy\\ =\frac{1}{\Gamma(\alpha)}\int_{\mathbb{R}}e^{\alpha y-e^y}dy\\ =1 \end{gather*} \] we can obtain \[ \Gamma(\alpha)=\int_{\mathbb{R}}e^{\alpha y-e^y}dy \] To calculate \(E(Y)=\int_{\mathbb{R}}yf_Y(y)dy\), we need to replace \(y\) with \(\alpha\) in order to eliminate the uncertainty.

To do so, we notice that \(f_Y(y)\) is a differentiable function of \(\alpha\), as \[ \frac{d}{d\alpha}e^{\alpha y-e^y}=ye^{\alpha y-e^y}\\ =y\Gamma(\alpha)f_Y(y)\\ \] Since we can obtain \(e^{\alpha y-e^y}=\Gamma(\alpha)f_Y(y)\) from \[ f_Y(y)=\frac{1}{\Gamma(\alpha)}e^{\alpha y-e^y} \] Now that we can calculate the \(E(Y)\), as \[ \begin{gather*} E(Y)=\int_{\mathbb{R}}yf_Y(y)dy\\ =\frac{1}{\Gamma(\alpha)}\int_{\mathbb{R}}\Gamma(\alpha)yf_Y(y)dy\\ =\frac{1}{\Gamma(\alpha)}\int_{\mathbb{R}}\frac{d}{d\alpha}e^{\alpha y-e^y}dy\\ =\frac{1}{\Gamma(\alpha)}\frac{d}{d\alpha}\int_{\mathbb{R}}e^{\alpha y-e^y}dy\\ =\frac{1}{\Gamma(\alpha)}\frac{d}{d\alpha}\Gamma(\alpha)\\ =\frac{d}{d\alpha}ln\Gamma(\alpha)\\ =\Psi(\alpha) \end{gather*} \] \(\Psi(\cdot)\) is the digamma function, which is the logarithmic derivative of the gamma function.

By restoring the scale parameter \(\beta\) value, we can obtain the general result of \(E[lnX]\) as \[ E[lnX]=\Psi(\alpha)-ln\beta \]

Reference

How to calculate the density function of \(g(X)\)?

What is the expected value of the logarithm of Gamma distribution?

The field of statistics essentially comprises two main branches: frequentist and Bayesian. The primary distinction between these two approaches, as articulated by Michael I. Jordan, lies in their respective viewpoints:

Frequentists typically proceed by moving forward from the state of nature to the data, while Bayesians operate in the opposite direction, moving from the data back to the state of nature. This fundamental contrast is exemplified in concepts such as the false discovery rate, which considers the probability of a hypothesis given that a discovery has been made, conditioning it on the data.

Contemplating these divergent perspectives gives rise to intriguing questions. Below are some inquiries I have formulated and explored:

  1. What are the distinctions between confidence interval and credible interval? What are the distinctions between coverage probability and credible probability?

    [George Casella Statistical Inference p435] When describing the interactions between the confidence interval and the parameter, we should carefully say that the interval covers the parameter, not that the parameter is inside the interval. This was done on purpose. It’s imperative to stress that the random quantity in classical frequentist partisan is the interval, not the parameter. The parameter is fixed. When we say a 90% confidence interval for a parameter, say \(lambda\) is \(a\leq \lambda \leq b\), we are tempting to say (and many experimenters do) that “The probability is 90% that \(\lambda\) is in the interval \([a,b]\).” However, within classical statistics, such a statement is invalid since the parameter is assumed fixed. Formally, the interval \([a,b]\) is one of the possible realized values of the random interval \([f(\alpha/2),g(1-\alpha/2)]\). And since the parameter \(\lambda\) does not move, \(\lambda\) is in the realized interval \([a,b]\) with probability either \(0\) or \(1\). When we say that the realized interval \([a,b]\) has a 90% chance of coverage, we only mean that we know that 90% of the sample points of the random interval cover the true parameter.

    In contrast, the Bayesian setup allows us to say \(\lambda\) is inside \([a,b]\) with some probability, not \(0\) or \(1\). This is because, under the Bayesian model, \(\lambda\) is random variable with a probability distribution. All Bayesian claims of coverage are made with respect to the posterior distribution of the parameter.

    It’s important not to confuse incredible probability (The Bayes posterior probability) with coverage probability (The classical probability). The probabilities are very different entities, with different meanings and interpretations. Credible probability comes from the posterior distribution, which in turn gets its probability from the prior distribution. Thus, credible probabilities reflects the experimenter’s subjective beliefs, as expressed in the prior distribution and updated with the data to the posterior distribution. A Bayesian assertion of 90% coverage means that the experimenter, upon combining prior knowledge with data, is 90% sure of coverage

  2. How to calculate the credible interval?

    In Bayesian statistics, the posterior distribution of a parameter can be derived and used for inference about that parameter. Since the parameter is considered a random variable in this framework, this distribution can be divided into numerous intervals, each covering 95% of the distribution, thus having a credible probability of 95%. Importantly, while there are multiple ways to construct these 95% credible intervals, they are not considered random in Bayesian analysis. Typically, we would choose the intervals that are symmetrically constructed around the distribution's central point. Once an interval is established, we interpret it as having a 95% probability of containing the true parameter value.

  3. In a frequentist hypothesis testing framework, is it possible to compare the probabilities of the null hypothesis (H0) and the alternative hypothesis (H1) to make a decision about accepting or rejecting the hypothesis?

    [George Casella Statistics inference P379] The probabilities \(P(H_0 )\text{ is true}\) and \(P(H_1 )\text{ is true}\) are not meaningful to the classical statistician. The classical statistician considers \(\theta\) to be a fixed number. Consequently, a hypothesis is either true or false. If \(\theta\in \theta_0\), \(P(H_0 )\text{ is true}=1\) and \(P(H_1 )\text{ is true}=0\) for all values of \(\mathbf{x}\). If \(\theta\in\theta_0^c\), these values are reversed. Since these probabilities are unknown (since \(\theta\) is unknown) and do not depend on the sample \(\mathbf{x}\), methods of calculating the probabilities \(P(H_0 )\text{ is true}\) and \(P(H_1 )\text{ is true}\) are not used by the classical statistician. In a Bayesian formulation of a hypothesis testing problem, these probabilities depend on the sample \(\mathbf{x}\) and can give useful information about the veracity of \(H_0\) and \(H_1\)

  4. Is the Likelihood Ratio Test (LRT) specific to the Bayesian or frequentist statistical approach, or is it applicable to both methodologies?

    The likelihood ratio test is a method used within the frequentist statistical framework. It is a hypothesis test that compares two statistical models based on the ratio of their likelihoods. The test evaluates how well two models, typically one being a special case or a subset of the other (referred to as the null and alternative models), explain a set of observed data.

    The likelihood ratio test is not inherently Bayesian; it is primarily a frequentist approach. However, the concept of comparing likelihoods of different models or hypotheses is also fundamental in Bayesian statistics, except it's applied in a different context and with a different framework.

Unsolved questions

  1. What is the evalution method for bayesian test?

  2. What is the bayesian P value?

    [What are Bayesian p-values?](https://stats.stackexchange.com/questions/171386/what-are-bayesian-p-values)

  3. Is power analysis necessary for the Bayesian test? If so, how to calculate the power of the Bayesian tests?

    [Is power analysis necessary in Bayesian Statistics?](https://stats.stackexchange.com/questions/65754/is-power-analysis-necessary-in-bayesian-statistics)

    [Power analysis from Bayesian point of view [duplicate]](https://stats.stackexchange.com/questions/110346/power-analysis-from-bayesian-point-of-view)

  4. Does bayesian test has type I error when the parameter is assumed to come from a probability distribution?

    Analysis of type I and II error rates of Bayesian and frequentist parametric and nonparametric two-sample hypothesis tests under preliminary assessment of normality

  5. Why does the false discovery rate kind of like the bayesian thinking, going back from data to state of nature; Why FDR is unlike other metrics, such as Recall, Precision, Accuracy, or others?

    False discovery rate

    Original paper: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

    Bradley Efron empirical Bayes book

IOGSMT

Issue Objectives Goals Strategies Measurement Tactics
Website lack users Increase number of users Increase users by 10% in one month Website SEO Increase users by 5% Screening out available SEO corporations
Decrease subscription fee Increase users by 5% Calculate a subscription fee based on the cost
Problem Goal Detailed Goal How to Measurement Detailed How to

Remark: IOGSMT is a blueprint to achieve our aspirations. While it focuses on the What and How, we can also incorporate the Why into it.

Framework of learning everything in a short period time

Build structures

Mark timelines

Always ask what, how and why

Example

The framework of learning a new area:

  • Build structure
  • Mark timeline
  • Find results-oriented data

Several useful databases:

How to train a final machine learning model

This section relates mostly to the following several questions:

  • what is the purpose of k-fold cross validation?
  • How to finalize a model?

What is the purpose of k-fold cross validation?

Both train-test splits and k-fold cross validation are examples of resampling methods. The purpose of these methods are to compute a performance measure for the model on test dataset. They are best viewed as methods to estimate the performance of a statistical procedure, rather than a statistical model.

How to finalize a model?

Once we have specified the estimated skills, we are finished with the resampling method

  • If we are using a train-test split, that means we can discard the split datasets and the trained model.
  • If we are using k-fold cross validation, that means we can throw away all of the trained models

In other words, once we select the best tuned model by using resampling method, we can finalize the model by applying the chosen machine learning procedure on the whole dataset

This leads to another question:

  • Why not withhold the model trained on the training dataset or the best model from the cross validation?

In fact, we can apply sub-models as the final model, but it may take an indefinite amount of time to train before we obtain them; while cross validation and other resampling approaches can compare various procedures, they also prevent us from selecting an overfitting model.. Furthermore, overfitting has to do with model complexity, and it has nothing to do with the amount of data. As such, the model will likely perform better when trained on all of the available data than just the subset used to estimate the performance of the model.

If one really wants to employ models obtained from cross validation as the final models, there are still other ways to do so, which is to create multiple final models and take the mean from an ensemble of predictions in order to reduce the variance;

It is not suggested to use k models from cross validation as the voting classifier if the true objective is to develop a voting classifier as the final model. In cross validation, each fold could only serve as the test dataset once; hence, k models derived by cross validation are incapable of aggregating results, providing the same purpose as a voting classifier.

An alternate method for employing ensemble models would be to use an ensemble-type estimation of performance or generalization error, such as out-of-bag or generalization error. Using an un-aggregated cross validation estimate for an ensemble model can result in a pessimistic bias that can range from negligible to substantial, depending on the stability of the CV surrogate models and the number of aggregated surrogate models. Another option may be to pick several models that appear to generalize well across various test sets. Each model would then be trained on the whole training data set with its best-tuned hyperparameters chosen to produce a final model. Finally, these final models can be combined to form the ensemble model.

Finalize the model in Caret

In Caret, the train function will carry out cross validation experiment (if we choose to) to determine the best tuned hyperparameters, and use that to train on the entire dataset in order to get the final model.

In this section, we will specify different parameters to see how cross validation works in caret functions

Use iris data to apply 10-folds cross validation for knn algorithm

1
2
3
4
5
6
7
8
9
10
11
> library(caret)
> inTraining <- createDataPartition(iris$Petal.Width, p = .75, list = FALSE)
> training <- iris[ inTraining,]
> testing <- iris[-inTraining,]
> dim(training)
[1] 114 5
> fitControl <- trainControl(## 10-fold CV
+ method = "cv",
+ number = 10,
+ classProbs = TRUE,
+ savePredictions = TRUE)

Tuned hyperparameter length is 10, which means there are \(10*114\) prediction dots

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
> knnFit1 <- train(Species ~ ., data = training, 
+ method = "knn",
+ trControl = fitControl,
+ tuneLength = 10,
+ preProcess = c('center','scale'))

# 10 groups of performance WRT 10 groups of tuned hyperparameters K
> knnFit1$results
k Accuracy Kappa AccuracySD KappaSD
1 5 0.9500000 0.9250000 0.08958064 0.13437096
2 7 0.9583333 0.9375000 0.07081972 0.10622957
3 9 0.9575758 0.9362500 0.07135347 0.10711656
4 11 0.9575758 0.9362500 0.07135347 0.10711656
5 13 0.9659091 0.9487500 0.05903369 0.08867270
6 15 0.9659091 0.9487500 0.05903369 0.08867270
7 17 0.9575758 0.9362500 0.05956599 0.08945242
8 19 0.9325758 0.8987500 0.08626242 0.12942850
9 21 0.9225758 0.8838246 0.08332400 0.12498478
10 23 0.9125758 0.8688993 0.06840424 0.10258360

# A peek of cross validation prediction
> head(knnFit1$pred)
pred obs setosa versicolor virginica rowIndex k Resample
1 setosa setosa 1 0 0 1 5 Fold01
2 setosa setosa 1 0 0 9 5 Fold01
3 setosa setosa 1 0 0 12 5 Fold01
4 setosa setosa 1 0 0 19 5 Fold01
5 versicolor versicolor 0 1 0 40 5 Fold01
6 versicolor versicolor 0 1 0 67 5 Fold01

# Each observation is predicted 10 times
> dim(knnFit1$pred)
[1] 1140 8

From 10-folds cross validation, the best tuned hyperparameter k is 15

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
> knnFit1
k-Nearest Neighbors

114 samples
4 predictor
3 classes: 'setosa', 'versicolor', 'virginica'

Pre-processing: centered (4), scaled (4)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 103, 102, 102, 102, 102, 104, ...
Resampling results across tuning parameters:

k Accuracy Kappa
5 0.9500000 0.9250000
7 0.9583333 0.9375000
9 0.9575758 0.9362500
11 0.9575758 0.9362500
13 0.9659091 0.9487500
15 0.9659091 0.9487500
17 0.9575758 0.9362500
19 0.9325758 0.8987500
21 0.9225758 0.8838246
23 0.9125758 0.8688993

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 15.

The final model will use the best tuned hyperparameter(s) and train on the entire dataset; thus there will be 114 fitted values.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
> knnFit1$finalModel
15-nearest neighbor model
Training set outcome distribution:

setosa versicolor virginica
37 39 38
> head(knnFit1$finalModel$learn$X)
Sepal.Length Sepal.Width Petal.Length Petal.Width
X1 -0.9054682 0.95513347 -1.337733 -1.313374
X2 -1.1395348 -0.14029123 -1.337733 -1.313374
X4 -1.4906348 0.07879371 -1.281974 -1.313374
X5 -1.0225015 1.17421841 -1.337733 -1.313374
X6 -0.5543683 1.83147324 -1.170455 -1.053210
X7 -1.4906348 0.73604853 -1.337733 -1.183292

> dim(knnFit1$finalModel$learn$X)
[1] 114 4

Set the parameter savePredictions ="best" will enable model display the cross validation results from the sub-model trained by the best tuned hyperparameter(s)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
> fitControl <- trainControl(## 10-fold CV
+ method = "cv",
+ number = 10,
+ classProbs = TRUE,
+ savePredictions = 'final')

> knnFit1 <- train(Species ~ ., data = training,
+ method = "knn",
+ trControl = fitControl,
+ tuneLength = 10,
+ preProcess = c('center','scale'))

> head(knnFit1$pred)
k pred obs setosa versicolor virginica rowIndex Resample
1 15 setosa setosa 1.00000000 0.00000000 0.00000000 9 Fold01
2 15 versicolor versicolor 0.06666667 0.86666667 0.06666667 45 Fold02
3 15 versicolor versicolor 0.00000000 0.93333333 0.06666667 46 Fold02
4 15 setosa setosa 1.00000000 0.00000000 0.00000000 1 Fold01
5 15 virginica virginica 0.00000000 0.06666667 0.93333333 93 Fold04
6 15 virginica virginica 0.00000000 0.33333333 0.66666667 96 Fold04

> dim(knnFit1$pred)
[1] 114 8

How to use preprocess with cross validation?

Data leakage

The data leakage problem is a significant issue associated with cross validation. It occurs when training dataset characteristics seep into the test dataset. For instance, processing your data using normalization or standardization on the complete training dataset prior to learning would not be a proper test because the scale of the data in the test set would have affected the training dataset. One might utilize the Pipeline function in Python to solve this issue. The caret package in R is also capable of handling it, where the author states the caret will automatically preprocess the data only to the resampled version of data to avoid data leakage.

As elements of statistical learning chapter 7 Model Assessment and Selection states:

In cross validation, samples must be “left out” before any selection or filtering steps are applied. But there is one qualification: initial unsupervised screening steps can be done before samples are left out. for example, we could select the 1000 predictors with highest variance across all 50 samples, before starting cross validation. Since this filtering does not involve the class labels, it does not give the predictors an unfair advantages.

Subsampling

Usually, an optimistic estimates of performance are more likely to exist if subsampling happens before cross validation. We could see some discussions in the following urls:

Complication

One complication is still related to preprocessing. Should the subsampling occur before or after the preprocessing? For example, if you down-sample the data and use PCA for signal extraction, should the loadings be estimated from the entire training set? The estimate is potentially better since the entire training set is being used but the subsample may happen to capture a small potion of the PCA space. There isn’t any obvious answer

Subsampling during resampling


Reference

  • A good brain diet

    • avocado
    • blueberries
    • broccoli
    • eggs
    • coconut/olive oil
    • green leafy dark vegetables
    • wild salmon
    • termeric
  • Killing ANTs (Automatic Negative thoughts)

  • Exercise

  • Brain nutrients/Brain Vitamins

  • Positive peer group

  • Clean environment

  • Sleep

  • Brain protection (such as electromagnetic fields)

  • New learnings

  • Stress management