0%

The Laplace technique can be used to approximate the reasonably well behaved functions that have most of their mass concentrated in a small area of their domain.(Laplace approximation) Technically, it works for functions that are in the class of \(\mathcal{L}^2\), which is also called the square integrable function, meaning \[ \int g(x)^2dx <\infty \] Such a function generally has a very rapid decreasing tails so that in the far reaches of the domain we would not expect to see large spikes.

The Laplace approximation framework is a simple but widely used framework, and aims to find a Gaussian approximation to a probability density defined over a set of continuous variables. (Laplace approximation)

In Bayesian statistics, Laplace approximation can refer to either approximating the posterior normalizing constant with Laplace's method or approximating the posterior distribution with a Gaussian centered at the maximum a posteriori estimate.(Amaral Turkman 2019)

Suppose we want to approximate the pdf \(p(\theta)\), which doesn't belong to any known distribution. The density curve is smooth and well peaked around its point of maxima \(\hat{\theta}\). Thereby, \(\frac{dp(\theta)}{d\theta}|_{\hat{\theta}}=0\) and \(\frac{d^2p(\theta)}{d\theta^2}|_{\hat{\theta}}<0\); thus we can conclude that \(\frac{dlnp(\theta)}{d\theta}|_{\hat{\theta}}=0\) and \(\frac{d^2lnp(\theta)}{d\theta^2}|_{\hat{\theta}}<0\)

Then we can approximate it by a normal pdf. Denote \(h(\theta)=lnp(\theta)\) \[ \begin{gather*} h(\theta)\approx lnp(\hat{\theta})+(\theta-\hat{\theta})\frac{dlnp(\theta)}{d\theta}|_{\hat{\theta}}+\frac{1}{2}(\theta-\hat{\theta})^2\frac{d^2lnp(\theta)}{d\theta^2}|_{\hat{\theta}}\\ =lnp(\hat{\theta})-\frac{1}{2}(\theta-\hat{\theta})^2\frac{-d^2lnp(\theta)}{d\theta^2}|_{\hat{\theta}}\ (\frac{dlnp(\theta)}{d\theta}|_{\hat{\theta}}=0)\\ =lnp(\hat{\theta})-\frac{(\theta-\hat{\theta})^2}{2\sigma^2}\ (\sigma^2=(\frac{-d^2lnp(\theta)}{d\theta^2}|_{\hat{\theta}})^{-1}>0)\\ \Rightarrow p(\theta)=e^{h(\theta)}\approx p(\hat{\theta})e^{-\frac{(\theta-\hat{\theta})^2}{2\sigma^2}}\\ \Rightarrow p(\theta)\propto e^{-\frac{(\theta-\hat{\theta})^2}{2\sigma^2}} \end{gather*} \] Suppose \(f(\theta)=e^{-\frac{(\theta-\hat{\theta})^2}{2\sigma^2}}\), then we can approximate \(p(\theta)\) as \[ \begin{gather*} p(\theta)\approx \frac{1}{\int f(\theta)d\theta}f(\theta)\\ =\frac{1}{\sqrt{2\pi}\sigma\int \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(\theta-\hat{\theta})^2}{2\sigma^2}}d\theta}e^{-\frac{(\theta-\hat{\theta})^2}{2\sigma^2}}\\ =\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(\theta-\hat{\theta})^2}{2\sigma^2}}\\ \text{where }\int f(\theta)d\theta\text{ is the normalizing term} \end{gather*} \] As a result, pdf of \(\theta\) is approximated by a normal distribution using Laplace method, which can be shown as below \[ \begin{gather*} \theta\sim N(\hat{\theta},\sigma^2)\\ \sigma^2=(\frac{-d^2lnp(\theta)}{d\theta^2}|_{\hat{\theta}})^{-1}>0 \end{gather*} \]

Reference

The p.d.f. of log-normal distribution is \[ f_X(x)=\frac{1}{x\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{lnx-\mu}{\sigma})^2} \]

The log-normal distribution is a right skewed continuous probability distribution, meaning it has a long tail towards the right. It is used for modelling various natural phenomena such as income distributions, the length of chess games or the time to repair a maintainable system and more.

If the random variable \(X\) is log-normally distributed, then \(Y=\ln(X)\) has a normal distribution. We can apply the same rule from Density of transformed random variable to prove this statement. \[ \begin{gather*} X\sim LogNormal(\mu,\sigma^2)\\ Y=lnX\\ X=\psi(Y)=e^Y\\ \Rightarrow g_Y(y)=f(\psi(y))|\frac{dx}{dy}|\\ =\frac{1}{e^y\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{y-\mu}{\sigma})^2}e^y\\ =\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{y-\mu}{\sigma})^2}\\ \Rightarrow Y\sim N(\mu,\sigma^2) \end{gather*} \]

Reference

A good interpretation of Beta distribution can be referred to a thread from What is the intuition behind beta distribution

The short version is that the Beta distribution can be understood as representing a distribution of probabilities, that is, it represents all the possible values of a probability when we don't know what that probability is.

For \(\theta\sim Beta(\alpha,\beta)\) \[ \begin{gather*} p(\theta)=\frac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{B(\alpha,\beta)}\\ =\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha-1}(1-\theta)^{\beta-1}\\ \theta\in[0,1] \end{gather*} \] The expectation and variance of Beta distribution is \[ \begin{gather*} E(\theta)=\frac{\alpha}{\alpha+\beta}\\ var(\theta)=\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)} \end{gather*} \] When \(\alpha=\beta=1\), Beta distribution is equivalent to the uniform distribution \[ \begin{gather*} p(\theta)=\frac{\Gamma(2)}{\Gamma(1)\Gamma(1)}\theta^0(1-\theta)^0\\ =\frac{1\Gamma(1)}{1*1}\\ =1\\ F_{\Theta}(x)=\int_0^x 1d\theta=x\in[0,1]\\ \Rightarrow \theta\sim U(0,1) \end{gather*} \]

Reference

Q value

From wikipedia, q value is phrased as follows

In statistical hypothesis testing, specifically multiple hypothesis testing, the q-value provides a means to control the positive false discovery rate (pFDR).John D. 2003 Just as the p-value gives the expected false positive rate obtained by rejecting the null hypothesis for any result with an equal or smaller p-value, the q-value gives the expected pFDR obtained by rejecting the null hypothesis for any result with an equal or smaller q-value.

From Statistics and Data analysis: From elementary to Intermediate, we can find the explicit definition of p-value

Simply rejecting or not rejecting \(H_0\) at a specified \(\alpha\) does not fully convey the information in the data. It is more useful to report the smallest a-level at which the observed test result is significant. This smallest \(\alpha\)-level is called the observed level of significance or the P-value. The smaller the P-value, the more significant is the test result. Once the P-value is computed, a test at any specified \(\alpha\) can be conducted by rejecting \(H_0\) if P-value\(<\alpha\). An alternative definition of the P-value is that it is the probability under Null hypothesis of obtaining a test statistic at least as "extreme" as the observed value.

We can provide an example to enhance the understanding of p-value and q-value. Assume we have \(1000\) tests. Each test will produce a p-value. We can order these p-values from the smallest to the largest. A p-value of \(0.05\) means \(5\%\) of all tests which is 50 tests are expected to output false positive results. A q-value of \(0.05\) means tests with q-value less than \(0.05\) are all considered to be significant, and \(5\%\) of these significant tests are expected to output false positive results.

One of the methods that adjust p-value to q-value is the Benjamini-Hochberg method, (short for BH). The BH-adjusted p-values are defined as \[ p_{(i)}^{BH}=min\{\underset{j\geq i}{min}\{\frac{mp_{(j)}}{j}\},1\} \] The process is as follows.

  1. Order all p-values in an ascending sequence. Then multiply each p-value by the total number of test \(m\) and divide by its rank order
  2. For the first p-value, find the minimum value of \(\frac{mp_{(j)}}{j}\)for \(j=1,2,\cdots,m\); compare with \(1\), assign q-value as \(1\) if it is larger than \(1\). This value would be the smallest q-value among all tests; For the second p-value, find the minimum value of \(\frac{mp_{(j)}}{j}\)for \(j=2,\cdots,m\), this q-value must be less than the first q-value. We begin this procedure until all p-values are adjusted to q-values

We can write BH method using R code

1
2
3
4
5
6
7
8
9
10
11
12
13
> BH <- function(p.values)
+ {
+ p <- p.values
+ lp <- length(p)
+ i <- lp:1L
+ # key to the p.values (key1)
+ o <- order(p,decreasing = TRUE)
+ # key to the key1 (key2)
+ ro <- order(o)
+ pmin(1,cummin(lp/i * p[o]))[ro]
+ }
> BH(c(0.05,0.02,0.2,0.4,0.5))
[1] 0.1250000 0.1000000 0.3333333 0.5000000 0.5000000

We can visualize this process

p.value 0.05 0.02 0.2 0.4 0.5
p.value position (key1) 1 2 3 4 5

\[ \downarrow \]

p.value position(key1) 5 4 3 1 2
q.value 0.5 0.5 1/3 0.125 0.1
Position of key (key2) 1 2 3 4 5

\[ \downarrow \]

position of key (key2) 4 5 3 2 1
p.value position (key1) 1 2 3 4 5
p.value 0.05 0.02 0.2 0.4 0.5
q.value 0.125 0.1 1/3 1/2 1/2

Reference

Suppose we are given the distribution of a continuous random variable \(X\). Let \(Y=\phi(X)\), where \(\phi\) is an invertible (one-to-one) function. Let \(\psi=\phi^{-1}\) so that \(X=\psi(Y)\). Furthermore,suppose \(\psi\) is differentiable. Then the p.d.f. of \(Y\) is \[ \begin{gather} g(y)=f(\psi(y))|\frac{dx}{dy}|\\ \text{where }x=\psi(y) \end{gather} \] From the equation, we can easily observe that \(|\frac{dx}{dy}|\) is the dispersion of \(Y\) when random variable \(X\) is transformed to random variable \(Y\). Thus one cannot simply replace \(X\) with \(\psi(y)\) to derive the density of \(Y\) without considering the difference of dispersion between \(X\) and \(Y\).

Proof:

Let \(G(y)\) represents the c.d.f. of \(Y\). first suppose \(\phi\) is a strictly increasing function. Then \(\psi=\phi^{-1}\) is also a strictly increasing function and \(d\psi(y)/dy>0\). Therefore \[ \begin{gather}\\ G(y)=P[Y\leq y]=P[\phi(X)\leq y]\\ =P(X\leq \phi^{-1}(y))=F(\psi(y)) \end{gather} \] Differentiating \(G(y)\) by using the chain rule, it follows that \[ \begin{gather} g(y)=\frac{dG(y)}{dy}=\frac{dF(\psi(y))}{dy}\\ =f(\psi(y))\frac{d\psi(y)}{dy}=f(\psi(y))\frac{dx}{dy} \end{gather} \] Next suppose \(\phi\) is a strictly decreasing function. Then \(\psi=\phi^{-1}\) is also a strictly decreasing function and \(d\psi(y)/dy<0\). Therefore \[ G(y)=P[Y\leq y] = P[\phi(X)\leq y]=P[X\geq\phi^{-1}(y)]=1-F(\psi(y)) \] Differentiating \(G(y)\) by using the chain rule, it follows that \[ \begin{gather} g(y)=\frac{dG(y)}{dy} = -\frac{dF(\psi(y))}{dy}\\ =-f(\psi(y))\frac{d\psi(y)}{dy}\\ =f(\psi(y))|\frac{dx}{dy}| \end{gather} \]

Example

Suppose \(Y=lnX\), where \(X\sim Beta(a,b)\).

  • Then we have p.d.f. of \(X\)

\[ \begin{gather} f_X(x)=\frac{x^{a-1}(1-x)^{b-1}}{B(a,b)}\\ \text{where }B(a,b)=\frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)}\\ x\in[0,1] \end{gather} \]

\[ \begin{gather} Y=\phi(X)=lnX\\ Y\in(-\infty,0]\\ X=\psi(Y)=e^Y\\ X\in[0,1] \end{gather} \]

  • Then we have p.d.f. of \(Y\)

\[ \begin{gather} g_Y(y)=f_X(\psi(y))|\frac{dx}{dy}|\\ =\frac{e^{y(a-1)}(1-e^y)^{b-1}}{B(a,b)}e^y\ (\frac{dx}{dy}=\frac{de^y}{dy}=e^y)\\ =\frac{e^{ay}(1-e^y)^{b-1}}{B(a,b)} \end{gather} \]

  • Next we derive the m.g.f. (moment generating function) of \(Y\)

\[ \begin{gather} M_Y(t)=E[e^{tY}]\\ =\int_{-\infty}^0e^{ty}g_Y(y)dy\\ =\int_{-\infty}^0\frac{e^{ay}(1-e^y)^{b-1}}{B(a,b)}e^{ty}dy\\ =\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\int_{-\infty}^0(e^y)^{a+t}(1-e^y)^{b-1}dy\\ =\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\int_0^1x^{a+t}(1-x)^{b-1}dlnx\ (e^y=e^{lnx}=x)\\ =\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\int_0^1x^{a+t}(1-x)^{b-1}\frac{1}{x}dx\\(\frac{dy}{dx}=\frac{dlnx}{dx}=\frac{1}{x}\Rightarrow dlnx=\frac{1}{x}dx)\\ =\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\int_0^1x^{a+t-1}(1-x)^{b-1}dx\\ =\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}B(a+t,b)\int_0^1\frac{x^{a+t-1}(1-x)^{b-1}}{B(a+t,b)}dx\\ =\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}B(a+t,b)\\ (\int_0^1\frac{x^{a+t-1}(1-x)^{b-1}}{B(a+t,b)}dx\text{ is the integration of distribution }Beta(a+t,b)\text{ which is 1})\\ =\frac{\Gamma(a+b)\Gamma(a+t)\Gamma(b)}{\Gamma(a)\Gamma(b)\Gamma(a+b+t)}\\ =\frac{\Gamma(a+b)\Gamma(a+t)}{\Gamma(a+b+t)\Gamma(a)} \end{gather} \]

Types of Statistical studies

  • Study with different purposes

    • Comparative (analytical) study

      A study whose purpose is to compare two or more alternative methods or groups distinguished by some attribute is called a comparative (analytical) study

    • Noncomparative (descriptive) study

      A study whose purpose is to learn about the characteristics of a group, but no necessarily to make comparasions, is called a noncomparative (descriptive) study

  • Study with or without intervention

    • Observational study

      The researcher is a passive observer who documents the process

    • Experimental study

      The researcher actively intervenes to control the study conditions and records the response

    Remark: Both study types can be either comparative or noncomparative

Confounding variable

If the effects of predictor variables cannot be separated from the effects of the uncontrolled variables, those uncontrolled variables are called confounding variables. A variable must meet two conditions to be a confounder:

  • It must be correlatedd with the independent variable. This may be a causal relationship, but it does not have to be.
  • It must be causally related to the dependent variable

Sometimes the presence of these variables is not recognized after the fact, in which case they are referred to as lurking variables

graph TD
A[Confounding variable]-.->B[predictor variable];
B -->C[response variable];
A -->C;

See more info

Comparative study

The goal of the simplest and most common comparative studies is to evaluate how a "change" from a "baseline" or "normal" condition affects a response variable.

  • The changed condition is called a treatment or an intervention and the normal condition is called a control

  • The treatment and control group should be similar in all other respects excepty for the changed condition to provide a valid comparison

Observational study

  • A sample survey provides a snapshot of the population based on a sample observed at a point in time
  • A prospective study follows a sample forward in time
  • A retrospective study follows a sample backward in time

Sample survey

  • A numerical characteristic of a population defined for a specific variable is called a parameter
  • A numerical function of the sample data, called a statistic, is used to estimate the unknown population parameter
  • It is possible to list all units in a finite population; a list of all units in a finite population is called a sampling frame;
  • It is impossible to list all units of an infinite population as there is no upper limit on the number of units in a population
  • The population of interest is called the target population; sometimes the target population is difficult to study, so sampling is done using another easily accessible population, called the sampled population
  • The deviation between a sample estimate and the true parameter value is called the sampling error; nonsampling error often cause bias, which is a systematic deviation between the sample estimate and the population parameter; bias is a more serious problem than sampling errors because it doesn't disappear by increasing the sample size. For example, measurement bias affects all measurements. Nonsampling error includes other types such as self-selection bias and response bias

Basic sampling designs

  • Simple random sample(SRS)

    Draw a sample of size \(n\) from a population of size \(N\) Without replacement

  • Stratified Random sampling

    Divide a population into homogeneous subpopulations and draw SRS from each one

  • Multistage cluster sampling

    Draw SRS layer by layer from the top to the bottom

  • Systematic sampling

    A 1-in-k systematic sampling consists of selecting one unit at random from the first \(k\) Unit and selecting every \(kth\) Unit thereafter; mainly useful for cases such as assembly line or cars entering a toll booth on a highway

Experimental studies

  • Treatment factors are controlled in an experiment, effects of the response variable are of primary interest
  • Nuisance factors, or noise factors are all the other factors that might affect the response variable
  • The different possible values of a factor are called levels, and each treatment is a particular combination of the levels of different treatment factors
  • Subjects or items receiving treatments are called experimental units; all experimental units receiving the same treatment form a treatment group
  • A run is an observation made on an experimental unit after a particular treatment condition; a replicate is another independent run carried out under a particular identical conditions; repeat measurements of the same response is not replicate

Strategies to reduce experimental error variation

Main components of experimental error:

  • Systematic error

    Caused by difference between experimental units; the nuisance factors on which the experimental units differ are said to confound or bias the treatment comparisons

  • Random error

    Caused by the inherent variability in the response of similar experimental units given the same treatment

  • Measurement error

    Caused by imprecise measurement instruments

  • Strategies to reduce the systematic error with respect to type of nuisance factor

    • Blocking: nuisance factor is controllable
    • Regression Analysis: nuisance factors cannot be used as blocking factor because they are not controllable; such nuisance factors are called covariates. A covariate should be measured before the treatment are applied to experimental units because it could also be affected by the treatment
    • Randomization: in terms of additional known or unknown nuisance factors, the experimental units may well differ on these factors, thus biasing the results. Randomization doesn't imply experimental units are equal for each treatment, but that no one treatment is favored

    Note: A covariates can be an independent variable (i.e. of direct interest) or it can be unwanted, confounding variable. Adding a covariate to a model can increase the accuracy of the results See more info

    In summary: Block over those nuisance factors that can be easily controlled, randomize over the rest

  • Strategy to reduce the random error

    Replicating the experimental runs; making multiple independent runs under identical treatment conditions

  • Strategy to reduce the measurement error

    Repeat measurements, including having different subjects make the measurements;

    Variation within each group of measurements can be used to estimate random and measurement errors.

Basic experimental designs

  • Completely randomized design (CRD)

    All experimental units are assigned at random to the treatment

    Batch 1 Batch 2 Batch 3 Batch 4 Batch 5
    A C D D B
    B A A C D
    C D B A C
    B C D B A

    Randomly assign \(20\) samples to the \(4\) treatments \(ABCD\) without considering the batchs they are from

  • Randomized block design (RBD)

    Forming blocks of units which are similar (in terms of chosen blocking factors)

    Two special cases:

    • Matched pairs design: Match subjects on the nuisance factors (controllable)
    • Cross-over design: the order in which the methods are administered may introduce bias due to the learning effect
    Batch 1 Batch 2 Batch 3 Batch 4 Batch 5
    A D C D B
    C A B C C
    B C D B A
    D B A A D

    Randomly assign \(4\) treatments to the experiments within each block

The coefficient of determination of \(R^2\)

To begin with, we need to mention several terms that we must define

  • SST: sum of squares total, as the squared difference between the observed dependent variable and its mean \[ \sum(y_i-\overline{y})^2 \]

  • SSR/ESS: sum of squares due to regression or explained sum of squares, as the sum of the differences between the predicted value and the mean of the dependent variable \[ \sum(\hat{y}_i-\overline{y})^2=\sum(\hat{y}_i-\overline{\hat{y}})^2 \]

  • SSE/RSS: sum of square error or residual sum of squares, as the difference between the observed value and the predicted value, and as the unexplained variability \[ \sum(y_i-\hat{y}_i)^2=\sum e_i^2 \]

The coefficient of determination of \(R^2\) shows how much of the variation of the dependent variable Var(y) can be explained by the model; In OLS estimate, \(R^2\) can also be calculated as the ratio of Explained sum of square (ESS) to total sum squared(TSS) \[ R^2=\frac{ESS}{SST}=1-\frac{SSE}{SST}=1-\frac{\sum e_i^2}{\sum(y_i-\overline{y})^2} \]

Formulate \(R^2\)

Suppose \(y_i=\hat{y}_i+e_i\) \[ \begin{gather} Var(y)=Var(y)+Var(e)+2Cov(\hat{y},e)\\ =Var(\hat{y})+Var(e)\ (Cov(\hat{y},e)=0)\\ \Rightarrow\sum(y_i-\overline{y})^2=\sum(\hat{y}_i-\overline{\hat{y}})^2+\sum(e_i-\overline{e})^2\\ =\sum(\hat{y}_i-\overline{\hat{y}})^2+\sum e_i^2\ (\overline{e}=0)\\ =\sum(\hat{y}_i-\overline{y})^2+\sum e_i^2 \end{gather} \]

Relationship between \(R^2\) and Pearson correlation coefficient

\[ \begin{gather} r^2_{y,\hat{y}}=(\frac{Cov(y,\hat{y})}{\sqrt{Var(y)Var(\hat{y})}})^2 =\frac{Cov(y,\hat{y})Cov(y,\hat{y})}{Var(y)Var(\hat{y})}\\ =\frac{Cov(\hat{y}+e,\hat{y})^2}{Var(y)Var(\hat{y})} =\frac{Cov(\hat{y},\hat{y})^2}{Var(y)Var(\hat{y})}\ (Cov(\hat{y},e)=0)\\ =\frac{Var(\hat{y})}{Var(y)} =\frac{\sum(\hat{y}_i-\overline{\hat{y}})^2}{\sum(y_i-\overline{y})^2}\\ =\frac{SSR}{SST}\\ =R^2 \end{gather} \]

Conclusion: \(R^2\) is equal to the squared Pearson correlation coefficient between observed response variable \(y\) and predicted response variable \(\hat{y}\)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
> iris_model <- lm(Petal.Width~Sepal.Length+Sepal.Width+Petal.Length,data = iris)
> summary(iris_model)

Call:
lm(formula = Petal.Width ~ Sepal.Length + Sepal.Width + Petal.Length,
data = iris)

Residuals:
Min 1Q Median 3Q Max
-0.60959 -0.10134 -0.01089 0.09825 0.60685

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.24031 0.17837 -1.347 0.18
Sepal.Length -0.20727 0.04751 -4.363 2.41e-05 ***
Sepal.Width 0.22283 0.04894 4.553 1.10e-05 ***
Petal.Length 0.52408 0.02449 21.399 < 2e-16 ***
---
Signif. codes: 0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.192 on 146 degrees of freedom
Multiple R-squared: 0.9379, Adjusted R-squared: 0.9366
F-statistic: 734.4 on 3 and 146 DF, p-value: < 2.2e-16

> (cor(iris_model$fitted.values,iris$Petal.Width))^2
[1] 0.9378503

Particularly, for simple linear regression, the coefficient of determination also equals the square of the sample correlation coefficient \(r_{xy}\)

Assume \(y_i=\beta_0+\beta_1x_i\), then the best linear unbiased estimates of \(\beta_0\) and \(\beta_1\) are \[ \begin{gather*} \hat{\beta_0}=\bar{y}-\hat{\beta_1}\bar{x}\\ \hat{\beta_1}=\frac{S_{xy}}{S_{xx}} \end{gather*} \] where \[ \begin{gather*} S_{xy}=\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})\\ S_{xx}=\sum_{i=1}^n(x_i-\bar{x})^2\\ S_{yy}=\sum_{i=1}^n(y_i-\bar{y})^2 \end{gather*} \] Then we could easily obtain the relationship between \(R^2\) and the \(r_{xy}\) as \[ \begin{gather*} R^2 = \frac{ESS}{SST}=\frac{\hat{\beta_1}^2S_{xx}}{S_{yy}}\\ =(\frac{S_{xy}}{S_xS_y})^2\\ =r^2_{xy} \end{gather*} \] We could also test this by running R code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
> model <- lm(Petal.Width ~ Sepal.Length,data = iris)
> summary(model)

Call:
lm(formula = Petal.Width ~ Sepal.Length, data = iris)

Residuals:
Min 1Q Median 3Q Max
-0.96671 -0.35936 -0.01787 0.28388 1.23329

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.20022 0.25689 -12.46 <2e-16 ***
Sepal.Length 0.75292 0.04353 17.30 <2e-16 ***
---
Signif. codes: 0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.44 on 148 degrees of freedom
Multiple R-squared: 0.669, Adjusted R-squared: 0.6668
F-statistic: 299.2 on 1 and 148 DF, p-value: < 2.2e-16

> cor(model$fitted.values,iris$Petal.Width)^2
[1] 0.6690277
> cor(iris$Petal.Width,iris$Sepal.Length)^2
[1] 0.6690277

Reference

Prove mean of predicted values in OLS regression is equal to the mean of original values

In OLS estimation, we can summarize response \(y\) as \[ y_i=\hat{y}_i+e_i \] where residual \(e_i\) is assumed to follow a normal distribution \(N(0,\sigma^2)\) \[ \sum e_i=\overline{e}=0 \] thereby, we have \[ \sum_{i=1}^ny_i=\sum_{i=1}^n(\hat{y}_i+e_i)\\ =\sum_{i=1}^n\hat{y}_i \]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
> iris_model <- lm(Petal.Width~Sepal.Length+Sepal.Width+Petal.Length,data = iris)
> summary(iris_model)

Call:
lm(formula = Petal.Width ~ Sepal.Length + Sepal.Width + Petal.Length,
data = iris)

Residuals:
Min 1Q Median 3Q Max
-0.60959 -0.10134 -0.01089 0.09825 0.60685

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.24031 0.17837 -1.347 0.18
Sepal.Length -0.20727 0.04751 -4.363 2.41e-05 ***
Sepal.Width 0.22283 0.04894 4.553 1.10e-05 ***
Petal.Length 0.52408 0.02449 21.399 < 2e-16 ***
---
Signif. codes: 0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.192 on 146 degrees of freedom
Multiple R-squared: 0.9379, Adjusted R-squared: 0.9366
F-statistic: 734.4 on 3 and 146 DF, p-value: < 2.2e-16

> mean(iris_model$fitted.values)
[1] 1.199333
> mean(iris$Petal.Width)
[1] 1.199333

Definition

The model is called “linear” not because it is linear in \(x\), but rather because it is linear in the parameters.

The following are the examples of linear models:

\[ \begin{gather*} y =\beta_0+\beta_1x_1+\beta_2x_2+\epsilon\\ y =\beta_0+\beta_1x_1+\beta_2x_2^2+\epsilon\\ y =\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_1^2+\beta_4x_2^2+\beta_5x_1x_2+\epsilon \end{gather*} \]

A model that can be transformed so that it becomes linear in its unknown parameters is called intrinsically linear; Otherwise it is called a nonlinear model

The following are nonlinear models

\[ \begin{gather*} y =\beta_0+\beta_1x_1+\beta_2x_2^{\beta_3}+\epsilon\\ y = \alpha_0z_1^{\alpha_1}z_2^{\alpha_2}\eta \end{gather*} \]

The following are some nonlinear models that can be made linear by transformation

\[ \begin{gather*} log \ y=log\alpha_0+\alpha_1logz_1+\alpha_2logz_2+log\eta\\ \Rightarrow y*=\beta_0+\beta_1x_1+\beta_2x_2+\epsilon \end{gather*} \]

There are four key assumptions for Simple Linear regression

  • Linear relationship between dependent variable and independent variable

  • The variance of the residuals is constant (constant variance errors, homoscedasticity)

  • Independence of observation (no autocorrelation). Simply put, the model assumes that the values of residuals are independent

    sample error is dependent,

  • Normally distributed errors

In addition to the four assumptions listed above, there is one more for Multiple Linear regression.

  • The data should not show multicollinearity.

Useful blogs

There is an interesting post about Multicollinearity that I'd like to mention here.

Is it necessary to correct collinearity when square terms are in a model?

  • Question: I had a regression model where one of the explanatory variable is "age". I added a "age-squared" variable since the distribution of age was in a quadratic form. It is obvious that the 'age' and 'age-squared' variables will be highly correlated. In that case, is it really necessary to deal the collinearity problem in the model?

Several answers are great:

  • Multicollinearity is NOT a problem in your case. Multicollinearity has to be checked and problems have to be solved when you want to estimate the independent effect of two variables which happen to be correlated by chance. This is NOT your problem with your age and age-squared variables since you should never be interested in evaluating the effect of changing age without changing agesquared. So do not care about multicollinearity between one variable and a second variable which is a deterministic non linear function of the first one. Except in the case of perfect multocillinearity, which would be the case if you had only two different values for your age variable.
  • Age and age squared are correlated. However, one is not a linear transformation of the other by definition.
  • Another possibility is to run a regression between Age and Age2 to see that the R2 is close to zero. This indicates, as mentioned before by other colleagues, that collinearity is very low and can be overlooked from a econometric perspective.

In linear regression, when is it appropriate to use the log of an independent variable instead of the actual values?

https://stats.stackexchange.com/questions/310003/why-in-box-cox-method-we-try-to-make-x-and-y-normally-distributed-but-thats-no?noredirect=1&lq=1

Reference

Principal Component analysis

PCA deduction using Spectral decomposition

Suppose we have a random vector \[ \begin{gather} X = \begin{pmatrix}X_1\\X_2\\ \vdots\\\ X_p\end{pmatrix}\\ var(X) = \Sigma=\begin{bmatrix}\sigma_1^2 & \sigma_{21}^2& \cdots & \sigma_{1p}^2\\ \sigma_{21}^2& \sigma_{22}^2 & \cdots & \sigma_{2p}^2\\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{p1}^2&\sigma_{p2}^2& \cdots & \sigma_{p}^2 \end{bmatrix}\\ \end{gather} \] We need to augment the spread between the data, so we need to consider linear combination \[ Y_i = e_{11}X_1+e_{12}X_2+\cdots+e_{1p}X_p=\mathbf{e}_i^TX \] and maximize the variance of linear combination \[ \begin{gather} var(Y_i)=E[(\mathbf{e}_i^TX-E(\mathbf{e}_i^TX))^2]\\ E[(\mathbf{e}_i^TX-E(\mathbf{e}_i^TX))^T(\mathbf{e}_i^TX-E(\mathbf{e}_i^TX))]\\ =E[(\mathbf{e}_i^TX-E(\mathbf{e}_i^TX))(\mathbf{e}_i^TX-E(\mathbf{e}_i^TX))^T](\text{variance is a scalar})\\ =E[\mathbf{e}_i^TXX^T\mathbf{e}_i-\mathbf{e}_i^TEXEX^T\mathbf{e}_i]\\ =E[\mathbf{e}_i^T(XX^T)\mathbf{e}_i]\\ =\mathbf{e}_i^T[E(XX^T)-E(X)E(X^T)]\mathbf{e}_i\\ =\mathbf{e}_i^T\Sigma \mathbf{e}_i \end{gather} \] Since \(\Sigma \mathbf{e}_i=\lambda_i\mathbf{e}_i\) in spectral decomposition \[ var(Y_i)=e_i^T\lambda_ie_i=\lambda_ie_i^Te_i=\lambda_i \]