Different methods of generating qqplot

The Q-Q plot, also known as the quantile-quantile plot, is a graphical tool that helps us determine whether a set of data originated from a Normal or exponential distribution. For instance, if a statistical analysis assumes that residuals are normally distributed, a Normal Q-Q plot can be used to verify this assumption. This is merely a visual inspection and not a conclusive proof, so it is somewhat subjective. However, it enables us to see at a glance whether our assumption is plausible, and if not, how the assumption is violated and which data points contribute to the violation. A Q-Q plot is a scatterplot created by superimposing two quantile distributions. If both sets of quantiles came from the same distribution, the points should form a roughly straight line.

This post focuses primarily on introducing functions that can reproduce the Q-Q plot.

qqnorm is a generic function whose default method generates a standard QQ plot of the y values.

1
2
3
> # generate points from normal distribution
> x <- qnorm(seq(0.01,0.99,0.01))+10
> qqnorm(x)

hi

This can be reproduced using the function ppoints, which generates the sequence of probability points, as follows:

1
2
3
> # generate points from normal distribution
> x <- qnorm(seq(0.01,0.99,0.01))+10
> plot(x = qnorm(ppoints(length(x))),y = x)

qqplot could also reproduce the above

1
2
3
> # generate points from normal distribution
> x <- qnorm(seq(0.01,0.99,0.01))+10
> qqplot(x = qnorm(ppoints(length(x))),y = x)

qplot could also reproduce the above

1
2
3
> library(ggplot2)
> x <- qnorm(seq(0.01,0.99,0.01))+10
> qplot(x = qnorm(ppoints(length(x))),y = x)

For residual plot, however, we may choose to plot standardized residuals against theoretical quantiles in Q-Q plot.

We can use the R dataset randu as an illustration.

1
2
3
4
5
> str(randu)
'data.frame': 400 obs. of 3 variables:
$ x: num 0.000031 0.044495 0.82244 0.322291 0.393595 ...
$ y: num 0.000183 0.155732 0.873416 0.648545 0.826873 ...
$ z: num 0.000824 0.533939 0.838542 0.990648 0.418881 ...
  • If dependent variable is scaled, normal residual qqplot could be different
1
2
3
4
5
6
7
8
> model <- lm(z ~x+y, data = randu)
> var(resid(model))
[1] 0.07731478
> qqnorm(resid(model), main = 'normal qqplot when z is not scaled')

> model <- lm(10*z~x+y, data =randu)
> var(resid(model))
> qqnorm(resid(model), main = 'normal qqplot when z is scaled')

Clearly, scaling the dependent variable z affects the residual.

Before scaling

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
> model <- lm(z ~x+y,data = randu)
> summary(model)

Call:
lm(formula = z ~ x + y, data = randu)

Residuals:
Min 1Q Median 3Q Max
-0.51260 -0.23114 -0.01989 0.22893 0.56084

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.48017 0.03811 12.598 <2e-16 ***
x 0.05424 0.04902 1.106 0.269
y -0.05713 0.04757 -1.201 0.230
---
Signif. codes: 0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2788 on 397 degrees of freedom
Multiple R-squared: 0.007008, Adjusted R-squared: 0.002006
F-statistic: 1.401 on 2 and 397 DF, p-value: 0.2476

After scaling

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
> model <- lm(10*z~x+y,data =randu)
> summary(model)

Call:
lm(formula = 10 * z ~ x + y, data = randu)

Residuals:
Min 1Q Median 3Q Max
-5.1260 -2.3114 -0.1989 2.2893 5.6084

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.8017 0.3811 12.598 <2e-16 ***
x 0.5424 0.4902 1.106 0.269
y -0.5713 0.4757 -1.201 0.230
---
Signif. codes: 0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.788 on 397 degrees of freedom
Multiple R-squared: 0.007008, Adjusted R-squared: 0.002006
F-statistic: 1.401 on 2 and 397 DF, p-value: 0.2476

Using standardized residuals, the expected value of the residuals is zero, while the variance is (approximately) one. This has two benefits:

  1. If you rescale one of your variables, the residual plots will not change.
  2. The residuals in the qqplot should lie on the line y = x
  3. The residuals in the qqplot should lie on the line y = x You anticipate that 95% of your residuals will fall between -1.96 and 1.96. This makes it simpler to identify outliers.
1
2
3
4
5
6
7
8
9
> model <- lm(z ~x+y,data = randu)
> var(rstandard(model))
[1] 1.002759
> qqnorm(rstandard(model),main = 'standardized residual qqplot before scaling')

> model <- lm(10*z~x+y,data =randu)
> var(rstandard(model))
[1] 1.002759
> qqnorm(rstandard(model), main = 'standardized residual qqplot after scaling')

Reference