Relationship between R-square and correlation coefficient
The coefficient of determination of \(R^2\)
To begin with, we need to mention several terms that we must define
SST: sum of squares total, as the squared difference between the observed dependent variable and its mean \[ \sum(y_i-\overline{y})^2 \]
SSR/ESS: sum of squares due to regression or explained sum of squares, as the sum of the differences between the predicted value and the mean of the dependent variable \[ \sum(\hat{y}_i-\overline{y})^2=\sum(\hat{y}_i-\overline{\hat{y}})^2 \]
SSE/RSS: sum of square error or residual sum of squares, as the difference between the observed value and the predicted value, and as the unexplained variability \[ \sum(y_i-\hat{y}_i)^2=\sum e_i^2 \]
The coefficient of determination of \(R^2\) shows how much of the variation of the dependent variable Var(y)
can be explained by the model; In OLS estimate, \(R^2\) can also be calculated as the ratio of Explained sum of square (ESS) to total sum squared(TSS) \[
R^2=\frac{ESS}{SST}=1-\frac{SSE}{SST}=1-\frac{\sum e_i^2}{\sum(y_i-\overline{y})^2}
\]
Formulate \(R^2\)
Suppose \(y_i=\hat{y}_i+e_i\) \[ \begin{gather} Var(y)=Var(y)+Var(e)+2Cov(\hat{y},e)\\ =Var(\hat{y})+Var(e)\ (Cov(\hat{y},e)=0)\\ \Rightarrow\sum(y_i-\overline{y})^2=\sum(\hat{y}_i-\overline{\hat{y}})^2+\sum(e_i-\overline{e})^2\\ =\sum(\hat{y}_i-\overline{\hat{y}})^2+\sum e_i^2\ (\overline{e}=0)\\ =\sum(\hat{y}_i-\overline{y})^2+\sum e_i^2 \end{gather} \]
Relationship between \(R^2\) and Pearson correlation coefficient
\[ \begin{gather} r^2_{y,\hat{y}}=(\frac{Cov(y,\hat{y})}{\sqrt{Var(y)Var(\hat{y})}})^2 =\frac{Cov(y,\hat{y})Cov(y,\hat{y})}{Var(y)Var(\hat{y})}\\ =\frac{Cov(\hat{y}+e,\hat{y})^2}{Var(y)Var(\hat{y})} =\frac{Cov(\hat{y},\hat{y})^2}{Var(y)Var(\hat{y})}\ (Cov(\hat{y},e)=0)\\ =\frac{Var(\hat{y})}{Var(y)} =\frac{\sum(\hat{y}_i-\overline{\hat{y}})^2}{\sum(y_i-\overline{y})^2}\\ =\frac{SSR}{SST}\\ =R^2 \end{gather} \]
Conclusion: \(R^2\) is equal to the squared Pearson correlation coefficient between observed response variable \(y\) and predicted response variable \(\hat{y}\)
1 | > iris_model <- lm(Petal.Width~Sepal.Length+Sepal.Width+Petal.Length,data = iris) |
Particularly, for simple linear regression, the coefficient of determination also equals the square of the sample correlation coefficient \(r_{xy}\)
Assume \(y_i=\beta_0+\beta_1x_i\), then the best linear unbiased estimates of \(\beta_0\) and \(\beta_1\) are \[ \begin{gather*} \hat{\beta_0}=\bar{y}-\hat{\beta_1}\bar{x}\\ \hat{\beta_1}=\frac{S_{xy}}{S_{xx}} \end{gather*} \] where \[ \begin{gather*} S_{xy}=\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})\\ S_{xx}=\sum_{i=1}^n(x_i-\bar{x})^2\\ S_{yy}=\sum_{i=1}^n(y_i-\bar{y})^2 \end{gather*} \] Then we could easily obtain the relationship between \(R^2\) and the \(r_{xy}\) as \[ \begin{gather*} R^2 = \frac{ESS}{SST}=\frac{\hat{\beta_1}^2S_{xx}}{S_{yy}}\\ =(\frac{S_{xy}}{S_xS_y})^2\\ =r^2_{xy} \end{gather*} \] We could also test this by running R code
1 | > model <- lm(Petal.Width ~ Sepal.Length,data = iris) |