Different methods of generating qqplot

Posted on 2022-11-07 Edited on 2022-11-08 In R

The Q-Q plot, also known as the quantile-quantile plot, is a graphical tool that helps us determine whether a set of data originated from a Normal or exponential distribution. For instance, if a statistical analysis assumes that residuals are normally distributed, a Normal Q-Q plot can be used to verify this assumption. This is merely a visual inspection and not a conclusive proof, so it is somewhat subjective. However, it enables us to see at a glance whether our assumption is plausible, and if not, how the assumption is violated and which data points contribute to the violation. A Q-Q plot is a scatterplot created by superimposing two quantile distributions. If both sets of quantiles came from the same distribution, the points should form a roughly straight line.

This post focuses primarily on introducing functions that can reproduce the Q-Q plot.

qqnorm is a generic function whose default method generates a standard QQ plot of the y values.

1
2
3

> # generate points from normal distribution
> x <- qnorm(seq(0.01,0.99,0.01))+10
> qqnorm(x)

This can be reproduced using the function ppoints, which generates the sequence of probability points, as follows:

1
2
3

> # generate points from normal distribution
> x <- qnorm(seq(0.01,0.99,0.01))+10
> plot(x = qnorm(ppoints(length(x))),y = x)

qqplot could also reproduce the above

1
2
3

> # generate points from normal distribution
> x <- qnorm(seq(0.01,0.99,0.01))+10
> qqplot(x = qnorm(ppoints(length(x))),y = x)

qplot could also reproduce the above

1
2
3

> library(ggplot2)
> x <- qnorm(seq(0.01,0.99,0.01))+10
> qplot(x = qnorm(ppoints(length(x))),y = x)

For residual plot, however, we may choose to plot standardized residuals against theoretical quantiles in Q-Q plot.

We can use the R dataset randu as an illustration.

> str(randu)
'data.frame':	400 obs. of  3 variables:
 $ x: num  0.000031 0.044495 0.82244 0.322291 0.393595 ...
 $ y: num  0.000183 0.155732 0.873416 0.648545 0.826873 ...
 $ z: num  0.000824 0.533939 0.838542 0.990648 0.418881 ...

If dependent variable is scaled, normal residual qqplot could be different

> model <- lm(z ~x+y, data = randu)
> var(resid(model))
[1] 0.07731478
> qqnorm(resid(model), main = 'normal qqplot when z is not scaled')

> model <- lm(10*z~x+y, data =randu)
> var(resid(model))
> qqnorm(resid(model), main = 'normal qqplot when z is scaled')

Clearly, scaling the dependent variable z affects the residual.

Before scaling

> model <- lm(z ~x+y,data = randu)
> summary(model)

Call:
lm(formula = z ~ x + y, data = randu)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.51260 -0.23114 -0.01989  0.22893  0.56084 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.48017    0.03811  12.598   <2e-16 ***
x            0.05424    0.04902   1.106    0.269    
y           -0.05713    0.04757  -1.201    0.230    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2788 on 397 degrees of freedom
Multiple R-squared:  0.007008,	Adjusted R-squared:  0.002006 
F-statistic: 1.401 on 2 and 397 DF,  p-value: 0.2476

After scaling

> model <- lm(10*z~x+y,data =randu)
> summary(model)

Call:
lm(formula = 10 * z ~ x + y, data = randu)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.1260 -2.3114 -0.1989  2.2893  5.6084 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.8017     0.3811  12.598   <2e-16 ***
x             0.5424     0.4902   1.106    0.269    
y            -0.5713     0.4757  -1.201    0.230    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.788 on 397 degrees of freedom
Multiple R-squared:  0.007008,	Adjusted R-squared:  0.002006 
F-statistic: 1.401 on 2 and 397 DF,  p-value: 0.2476

Using standardized residuals, the expected value of the residuals is zero, while the variance is (approximately) one. This has two benefits:

If you rescale one of your variables, the residual plots will not change.
The residuals in the qqplot should lie on the line y = x
The residuals in the qqplot should lie on the line y = x You anticipate that 95% of your residuals will fall between -1.96 and 1.96. This makes it simpler to identify outliers.

> model <- lm(z ~x+y,data = randu)
> var(rstandard(model))
[1] 1.002759
> qqnorm(rstandard(model),main = 'standardized residual qqplot before scaling')

> model <- lm(10*z~x+y,data =randu)
> var(rstandard(model))
[1] 1.002759
> qqnorm(rstandard(model), main = 'standardized residual qqplot after scaling')

Reference