Relationship between R-square and correlation coefficient

Posted on 2022-04-21 Edited on 2022-11-11 In statistics

The coefficient of determination of \(R^2\)

To begin with, we need to mention several terms that we must define

SST: sum of squares total, as the squared difference between the observed dependent variable and its mean \[ \sum(y_i-\overline{y})^2 \]
SSR/ESS: sum of squares due to regression or explained sum of squares, as the sum of the differences between the predicted value and the mean of the dependent variable \[ \sum(\hat{y}_i-\overline{y})^2=\sum(\hat{y}_i-\overline{\hat{y}})^2 \]
SSE/RSS: sum of square error or residual sum of squares, as the difference between the observed value and the predicted value, and as the unexplained variability \[ \sum(y_i-\hat{y}_i)^2=\sum e_i^2 \]

The coefficient of determination of \(R^2\) shows how much of the variation of the dependent variable Var(y) can be explained by the model; In OLS estimate, \(R^2\) can also be calculated as the ratio of Explained sum of square (ESS) to total sum squared(TSS) \[ R^2=\frac{ESS}{SST}=1-\frac{SSE}{SST}=1-\frac{\sum e_i^2}{\sum(y_i-\overline{y})^2} \]

Formulate \(R^2\)

Suppose \(y_i=\hat{y}_i+e_i\) \[ \begin{gather} Var(y)=Var(y)+Var(e)+2Cov(\hat{y},e)\\ =Var(\hat{y})+Var(e)\ (Cov(\hat{y},e)=0)\\ \Rightarrow\sum(y_i-\overline{y})^2=\sum(\hat{y}_i-\overline{\hat{y}})^2+\sum(e_i-\overline{e})^2\\ =\sum(\hat{y}_i-\overline{\hat{y}})^2+\sum e_i^2\ (\overline{e}=0)\\ =\sum(\hat{y}_i-\overline{y})^2+\sum e_i^2 \end{gather} \]

Relationship between \(R^2\) and Pearson correlation coefficient

\[ \begin{gather} r^2_{y,\hat{y}}=(\frac{Cov(y,\hat{y})}{\sqrt{Var(y)Var(\hat{y})}})^2 =\frac{Cov(y,\hat{y})Cov(y,\hat{y})}{Var(y)Var(\hat{y})}\\ =\frac{Cov(\hat{y}+e,\hat{y})^2}{Var(y)Var(\hat{y})} =\frac{Cov(\hat{y},\hat{y})^2}{Var(y)Var(\hat{y})}\ (Cov(\hat{y},e)=0)\\ =\frac{Var(\hat{y})}{Var(y)} =\frac{\sum(\hat{y}_i-\overline{\hat{y}})^2}{\sum(y_i-\overline{y})^2}\\ =\frac{SSR}{SST}\\ =R^2 \end{gather} \]

Conclusion: \(R^2\) is equal to the squared Pearson correlation coefficient between observed response variable \(y\) and predicted response variable \(\hat{y}\)

> iris_model <- lm(Petal.Width~Sepal.Length+Sepal.Width+Petal.Length,data = iris)
> summary(iris_model)

Call:
lm(formula = Petal.Width ~ Sepal.Length + Sepal.Width + Petal.Length, 
    data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.60959 -0.10134 -0.01089  0.09825  0.60685 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.24031    0.17837  -1.347     0.18    
Sepal.Length -0.20727    0.04751  -4.363 2.41e-05 ***
Sepal.Width   0.22283    0.04894   4.553 1.10e-05 ***
Petal.Length  0.52408    0.02449  21.399  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.192 on 146 degrees of freedom
Multiple R-squared:  0.9379,	Adjusted R-squared:  0.9366 
F-statistic: 734.4 on 3 and 146 DF,  p-value: < 2.2e-16

> (cor(iris_model$fitted.values,iris$Petal.Width))^2
[1] 0.9378503

Particularly, for simple linear regression, the coefficient of determination also equals the square of the sample correlation coefficient \(r_{xy}\)

Assume \(y_i=\beta_0+\beta_1x_i\), then the best linear unbiased estimates of \(\beta_0\) and \(\beta_1\) are \[ \begin{gather*} \hat{\beta_0}=\bar{y}-\hat{\beta_1}\bar{x}\\ \hat{\beta_1}=\frac{S_{xy}}{S_{xx}} \end{gather*} \] where \[ \begin{gather*} S_{xy}=\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})\\ S_{xx}=\sum_{i=1}^n(x_i-\bar{x})^2\\ S_{yy}=\sum_{i=1}^n(y_i-\bar{y})^2 \end{gather*} \] Then we could easily obtain the relationship between \(R^2\) and the \(r_{xy}\) as \[ \begin{gather*} R^2 = \frac{ESS}{SST}=\frac{\hat{\beta_1}^2S_{xx}}{S_{yy}}\\ =(\frac{S_{xy}}{S_xS_y})^2\\ =r^2_{xy} \end{gather*} \] We could also test this by running R code

> model <- lm(Petal.Width ~ Sepal.Length,data = iris)
> summary(model)

Call:
lm(formula = Petal.Width ~ Sepal.Length, data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.96671 -0.35936 -0.01787  0.28388  1.23329 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -3.20022    0.25689  -12.46   <2e-16 ***
Sepal.Length  0.75292    0.04353   17.30   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.44 on 148 degrees of freedom
Multiple R-squared:  0.669,	Adjusted R-squared:  0.6668 
F-statistic: 299.2 on 1 and 148 DF,  p-value: < 2.2e-16

> cor(model$fitted.values,iris$Petal.Width)^2
[1] 0.6690277
> cor(iris$Petal.Width,iris$Sepal.Length)^2
[1] 0.6690277

The coefficient of determination of \(R^2\)

Formulate \(R^2\)

Relationship between \(R^2\) and Pearson correlation coefficient

Reference