Terminologies in collecting data
Types of Statistical studies
Study with different purposes
Comparative (analytical) study
A study whose purpose is to compare two or more alternative methods or groups distinguished by some attribute is called a comparative (analytical) study
Noncomparative (descriptive) study
A study whose purpose is to learn about the characteristics of a group, but no necessarily to make comparasions, is called a noncomparative (descriptive) study
Study with or without intervention
Observational study
The researcher is a passive observer who documents the process
Experimental study
The researcher actively intervenes to control the study conditions and records the response
Remark: Both study types can be either comparative or noncomparative
Confounding variable
If the effects of predictor variables cannot be separated from the effects of the uncontrolled variables, those uncontrolled variables are called confounding variables. A variable must meet two conditions to be a confounder:
- It must be correlatedd with the independent variable. This may be a causal relationship, but it does not have to be.
- It must be causally related to the dependent variable
Sometimes the presence of these variables is not recognized after the fact, in which case they are referred to as lurking variables
graph TD
A[Confounding variable]-.->B[predictor variable];
B -->C[response variable];
A -->C;
Comparative study
The goal of the simplest and most common comparative studies is to evaluate how a "change" from a
"baseline"
or"normal"
condition affects a response variable.
The changed condition is called a treatment or an intervention and the normal condition is called a control
The treatment and control group should be similar in all other respects excepty for the changed condition to provide a valid comparison
Observational study
- A sample survey provides a snapshot of the population based on a sample observed at a point
in time
- A prospective study follows a sample
forward in time
- A retrospective study follows a sample
backward in time
Sample survey
- A numerical characteristic of a population defined for a specific variable is called a parameter
- A numerical function of the
sample data
, called a statistic, is used to estimate the unknown population parameter - It is possible to list all units in a finite population; a list of all units in a finite population is called a sampling frame;
- It is impossible to list all units of an infinite population as there is no upper limit on the number of units in a population
- The population of interest is called the target population; sometimes the target population is difficult to study, so sampling is done using another easily accessible population, called the sampled population
- The deviation between a sample estimate and the true parameter value is called the sampling error; nonsampling error often cause bias, which is a
systematic deviation
between the sample estimate and the population parameter; bias is a more serious problem than sampling errors because it doesn't disappear by increasing the sample size. For example, measurement bias affects all measurements. Nonsampling error includes other types such as self-selection bias and response bias
Basic sampling designs
Simple random sample(SRS)
Draw a sample of size \(n\) from a population of size \(N\) Without replacement
Stratified Random sampling
Divide a population into homogeneous subpopulations and draw SRS from each one
Multistage cluster sampling
Draw SRS layer by layer from the top to the bottom
Systematic sampling
A
1-in-k
systematic sampling consists of selecting one unit at random from the first \(k\) Unit and selecting every \(kth\) Unit thereafter; mainly useful for cases such as assembly line or cars entering a toll booth on a highway
Experimental studies
- Treatment factors are controlled in an experiment, effects of the response variable are of primary interest
- Nuisance factors, or noise factors are all the other factors that might affect the response variable
- The different possible values of a factor are called levels, and each treatment is a particular combination of the levels of different treatment factors
- Subjects or items receiving treatments are called experimental units; all experimental units receiving the same treatment form a treatment group
- A run is an observation made on an experimental unit after a particular treatment condition; a replicate is another independent run carried out under a particular identical conditions; repeat measurements of the same response is not replicate
Strategies to reduce experimental error variation
Main components of experimental error:
Systematic error
Caused by difference between experimental units; the nuisance factors on which the experimental units differ are said to confound or bias the treatment comparisons
Random error
Caused by the inherent variability in the response of similar experimental units
given the same treatment
Measurement error
Caused by imprecise measurement instruments
Strategies
to reduce thesystematic error
with respect to type ofnuisance factor
- Blocking: nuisance factor is controllable
- Regression Analysis: nuisance factors cannot be used as blocking factor because they are not controllable; such nuisance factors are called covariates. A covariate should be measured before the treatment are applied to experimental units because it could also be affected by the treatment
- Randomization: in terms of additional known or unknown nuisance factors, the experimental units may well differ on these factors, thus biasing the results. Randomization doesn't imply experimental units are equal for each treatment, but that no one treatment is favored
Note:
A covariates can be an independent variable (i.e. of direct interest) or it can be unwanted, confounding variable. Adding a covariate to a model can increase the accuracy of the results See more infoIn summary
: Block over those nuisance factors that can be easily controlled, randomize over the restStrategy to reduce the
random error
Replicating the experimental runs; making multiple independent runs under identical treatment conditions
Strategy to reduce the measurement error
Repeat measurements, including having different subjects make the measurements;
Variation within each group of measurements can be used to estimate random and measurement errors.
Basic experimental designs
Completely randomized design (CRD)
All experimental units are assigned at random to the treatment
Batch 1 Batch 2 Batch 3 Batch 4 Batch 5 A C D D B B A A C D C D B A C B C D B A Randomly assign \(20\) samples to the \(4\) treatments \(ABCD\) without considering the batchs they are from
Randomized block design (RBD)
Forming blocks of units which are similar (in terms of chosen blocking factors)
Two special cases:
- Matched pairs design: Match subjects on the nuisance factors (controllable)
- Cross-over design: the order in which the methods are administered may introduce bias due to the learning effect
Batch 1 Batch 2 Batch 3 Batch 4 Batch 5 A D C D B C A B C C B C D B A D B A A D Randomly assign \(4\) treatments to the experiments within each block