Relationship between R-square and correlation coefficient

The coefficient of determination of \(R^2\)

To begin with, we need to mention several terms that we must define

  • SST: sum of squares total, as the squared difference between the observed dependent variable and its mean \[ \sum(y_i-\overline{y})^2 \]

  • SSR/ESS: sum of squares due to regression or explained sum of squares, as the sum of the differences between the predicted value and the mean of the dependent variable \[ \sum(\hat{y}_i-\overline{y})^2=\sum(\hat{y}_i-\overline{\hat{y}})^2 \]

  • SSE/RSS: sum of square error or residual sum of squares, as the difference between the observed value and the predicted value, and as the unexplained variability \[ \sum(y_i-\hat{y}_i)^2=\sum e_i^2 \]

The coefficient of determination of \(R^2\) shows how much of the variation of the dependent variable Var(y) can be explained by the model; In OLS estimate, \(R^2\) can also be calculated as the ratio of Explained sum of square (ESS) to total sum squared(TSS) \[ R^2=\frac{ESS}{SST}=1-\frac{SSE}{SST}=1-\frac{\sum e_i^2}{\sum(y_i-\overline{y})^2} \]

Formulate \(R^2\)

Suppose \(y_i=\hat{y}_i+e_i\) \[ \begin{gather} Var(y)=Var(y)+Var(e)+2Cov(\hat{y},e)\\ =Var(\hat{y})+Var(e)\ (Cov(\hat{y},e)=0)\\ \Rightarrow\sum(y_i-\overline{y})^2=\sum(\hat{y}_i-\overline{\hat{y}})^2+\sum(e_i-\overline{e})^2\\ =\sum(\hat{y}_i-\overline{\hat{y}})^2+\sum e_i^2\ (\overline{e}=0)\\ =\sum(\hat{y}_i-\overline{y})^2+\sum e_i^2 \end{gather} \]

Relationship between \(R^2\) and Pearson correlation coefficient

\[ \begin{gather} r^2_{y,\hat{y}}=(\frac{Cov(y,\hat{y})}{\sqrt{Var(y)Var(\hat{y})}})^2 =\frac{Cov(y,\hat{y})Cov(y,\hat{y})}{Var(y)Var(\hat{y})}\\ =\frac{Cov(\hat{y}+e,\hat{y})^2}{Var(y)Var(\hat{y})} =\frac{Cov(\hat{y},\hat{y})^2}{Var(y)Var(\hat{y})}\ (Cov(\hat{y},e)=0)\\ =\frac{Var(\hat{y})}{Var(y)} =\frac{\sum(\hat{y}_i-\overline{\hat{y}})^2}{\sum(y_i-\overline{y})^2}\\ =\frac{SSR}{SST}\\ =R^2 \end{gather} \]

Conclusion: \(R^2\) is equal to the squared Pearson correlation coefficient between observed response variable \(y\) and predicted response variable \(\hat{y}\)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
> iris_model <- lm(Petal.Width~Sepal.Length+Sepal.Width+Petal.Length,data = iris)
> summary(iris_model)

Call:
lm(formula = Petal.Width ~ Sepal.Length + Sepal.Width + Petal.Length,
data = iris)

Residuals:
Min 1Q Median 3Q Max
-0.60959 -0.10134 -0.01089 0.09825 0.60685

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.24031 0.17837 -1.347 0.18
Sepal.Length -0.20727 0.04751 -4.363 2.41e-05 ***
Sepal.Width 0.22283 0.04894 4.553 1.10e-05 ***
Petal.Length 0.52408 0.02449 21.399 < 2e-16 ***
---
Signif. codes: 0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.192 on 146 degrees of freedom
Multiple R-squared: 0.9379, Adjusted R-squared: 0.9366
F-statistic: 734.4 on 3 and 146 DF, p-value: < 2.2e-16

> (cor(iris_model$fitted.values,iris$Petal.Width))^2
[1] 0.9378503

Particularly, for simple linear regression, the coefficient of determination also equals the square of the sample correlation coefficient \(r_{xy}\)

Assume \(y_i=\beta_0+\beta_1x_i\), then the best linear unbiased estimates of \(\beta_0\) and \(\beta_1\) are \[ \begin{gather*} \hat{\beta_0}=\bar{y}-\hat{\beta_1}\bar{x}\\ \hat{\beta_1}=\frac{S_{xy}}{S_{xx}} \end{gather*} \] where \[ \begin{gather*} S_{xy}=\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})\\ S_{xx}=\sum_{i=1}^n(x_i-\bar{x})^2\\ S_{yy}=\sum_{i=1}^n(y_i-\bar{y})^2 \end{gather*} \] Then we could easily obtain the relationship between \(R^2\) and the \(r_{xy}\) as \[ \begin{gather*} R^2 = \frac{ESS}{SST}=\frac{\hat{\beta_1}^2S_{xx}}{S_{yy}}\\ =(\frac{S_{xy}}{S_xS_y})^2\\ =r^2_{xy} \end{gather*} \] We could also test this by running R code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
> model <- lm(Petal.Width ~ Sepal.Length,data = iris)
> summary(model)

Call:
lm(formula = Petal.Width ~ Sepal.Length, data = iris)

Residuals:
Min 1Q Median 3Q Max
-0.96671 -0.35936 -0.01787 0.28388 1.23329

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.20022 0.25689 -12.46 <2e-16 ***
Sepal.Length 0.75292 0.04353 17.30 <2e-16 ***
---
Signif. codes: 0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.44 on 148 degrees of freedom
Multiple R-squared: 0.669, Adjusted R-squared: 0.6668
F-statistic: 299.2 on 1 and 148 DF, p-value: < 2.2e-16

> cor(model$fitted.values,iris$Petal.Width)^2
[1] 0.6690277
> cor(iris$Petal.Width,iris$Sepal.Length)^2
[1] 0.6690277

Reference