Assumption of linear regression

Definition

The model is called “linear” not because it is linear in \(x\), but rather because it is linear in the parameters.

The following are the examples of linear models:

\[ \begin{gather*} y =\beta_0+\beta_1x_1+\beta_2x_2+\epsilon\\ y =\beta_0+\beta_1x_1+\beta_2x_2^2+\epsilon\\ y =\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_1^2+\beta_4x_2^2+\beta_5x_1x_2+\epsilon \end{gather*} \]

A model that can be transformed so that it becomes linear in its unknown parameters is called intrinsically linear; Otherwise it is called a nonlinear model

The following are nonlinear models

\[ \begin{gather*} y =\beta_0+\beta_1x_1+\beta_2x_2^{\beta_3}+\epsilon\\ y = \alpha_0z_1^{\alpha_1}z_2^{\alpha_2}\eta \end{gather*} \]

The following are some nonlinear models that can be made linear by transformation

\[ \begin{gather*} log \ y=log\alpha_0+\alpha_1logz_1+\alpha_2logz_2+log\eta\\ \Rightarrow y*=\beta_0+\beta_1x_1+\beta_2x_2+\epsilon \end{gather*} \]

There are four key assumptions for Simple Linear regression

  • Linear relationship between dependent variable and independent variable

  • The variance of the residuals is constant (constant variance errors, homoscedasticity)

  • Independence of observation (no autocorrelation). Simply put, the model assumes that the values of residuals are independent

    sample error is dependent,

  • Normally distributed errors

In addition to the four assumptions listed above, there is one more for Multiple Linear regression.

  • The data should not show multicollinearity.

Useful blogs

There is an interesting post about Multicollinearity that I'd like to mention here.

Is it necessary to correct collinearity when square terms are in a model?

  • Question: I had a regression model where one of the explanatory variable is "age". I added a "age-squared" variable since the distribution of age was in a quadratic form. It is obvious that the 'age' and 'age-squared' variables will be highly correlated. In that case, is it really necessary to deal the collinearity problem in the model?

Several answers are great:

  • Multicollinearity is NOT a problem in your case. Multicollinearity has to be checked and problems have to be solved when you want to estimate the independent effect of two variables which happen to be correlated by chance. This is NOT your problem with your age and age-squared variables since you should never be interested in evaluating the effect of changing age without changing agesquared. So do not care about multicollinearity between one variable and a second variable which is a deterministic non linear function of the first one. Except in the case of perfect multocillinearity, which would be the case if you had only two different values for your age variable.
  • Age and age squared are correlated. However, one is not a linear transformation of the other by definition.
  • Another possibility is to run a regression between Age and Age2 to see that the R2 is close to zero. This indicates, as mentioned before by other colleagues, that collinearity is very low and can be overlooked from a econometric perspective.

In linear regression, when is it appropriate to use the log of an independent variable instead of the actual values?

https://stats.stackexchange.com/questions/310003/why-in-box-cox-method-we-try-to-make-x-and-y-normally-distributed-but-thats-no?noredirect=1&lq=1

Reference