Terminologies in collecting data

Posted on 2022-04-25 Edited on 2022-11-07 In statistics

Types of Statistical studies

Study with different purposes
- Comparative (analytical) study
  
  A study whose purpose is to compare two or more alternative methods or groups distinguished by some attribute is called a comparative (analytical) study
- Noncomparative (descriptive) study
  
  A study whose purpose is to learn about the characteristics of a group, but no necessarily to make comparasions, is called a noncomparative (descriptive) study
Study with or without intervention
- Observational study
  
  The researcher is a passive observer who documents the process
- Experimental study
  
  The researcher actively intervenes to control the study conditions and records the response
Remark: Both study types can be either comparative or noncomparative

Confounding variable

If the effects of predictor variables cannot be separated from the effects of the uncontrolled variables, those uncontrolled variables are called confounding variables. A variable must meet two conditions to be a confounder:

It must be correlatedd with the independent variable. This may be a causal relationship, but it does not have to be.
It must be causally related to the dependent variable

Sometimes the presence of these variables is not recognized after the fact, in which case they are referred to as lurking variables

graph TD
A[Confounding variable]-.->B[predictor variable];
B -->C[response variable];
A -->C;

See more info

Comparative study

The goal of the simplest and most common comparative studies is to evaluate how a "change" from a "baseline" or "normal" condition affects a response variable.

The changed condition is called a treatment or an intervention and the normal condition is called a control
The treatment and control group should be similar in all other respects excepty for the changed condition to provide a valid comparison

Observational study

A sample survey provides a snapshot of the population based on a sample observed at a point in time
A prospective study follows a sample forward in time
A retrospective study follows a sample backward in time

Sample survey

A numerical characteristic of a population defined for a specific variable is called a parameter
A numerical function of the sample data, called a statistic, is used to estimate the unknown population parameter
It is possible to list all units in a finite population; a list of all units in a finite population is called a sampling frame;
It is impossible to list all units of an infinite population as there is no upper limit on the number of units in a population
The population of interest is called the target population; sometimes the target population is difficult to study, so sampling is done using another easily accessible population, called the sampled population
The deviation between a sample estimate and the true parameter value is called the sampling error; nonsampling error often cause bias, which is a systematic deviation between the sample estimate and the population parameter; bias is a more serious problem than sampling errors because it doesn't disappear by increasing the sample size. For example, measurement bias affects all measurements. Nonsampling error includes other types such as self-selection bias and response bias

Basic sampling designs

Simple random sample(SRS)

Draw a sample of size \(n\) from a population of size \(N\) Without replacement
Stratified Random sampling

Divide a population into homogeneous subpopulations and draw SRS from each one
Multistage cluster sampling

Draw SRS layer by layer from the top to the bottom
Systematic sampling

A 1-in-k systematic sampling consists of selecting one unit at random from the first \(k\) Unit and selecting every \(kth\) Unit thereafter; mainly useful for cases such as assembly line or cars entering a toll booth on a highway

Experimental studies

Treatment factors are controlled in an experiment, effects of the response variable are of primary interest
Nuisance factors, or noise factors are all the other factors that might affect the response variable
The different possible values of a factor are called levels, and each treatment is a particular combination of the levels of different treatment factors
Subjects or items receiving treatments are called experimental units; all experimental units receiving the same treatment form a treatment group
A run is an observation made on an experimental unit after a particular treatment condition; a replicate is another independent run carried out under a particular identical conditions; repeat measurements of the same response is not replicate

Strategies to reduce experimental error variation

Main components of experimental error:

Systematic error

Caused by difference between experimental units; the nuisance factors on which the experimental units differ are said to confound or bias the treatment comparisons

Random error

Caused by the inherent variability in the response of similar experimental units given the same treatment

Measurement error

Caused by imprecise measurement instruments

Strategies to reduce the systematic error with respect to type of nuisance factor
- Blocking: nuisance factor is controllable
- Regression Analysis: nuisance factors cannot be used as blocking factor because they are not controllable; such nuisance factors are called covariates. A covariate should be measured before the treatment are applied to experimental units because it could also be affected by the treatment
- Randomization: in terms of additional known or unknown nuisance factors, the experimental units may well differ on these factors, thus biasing the results. Randomization doesn't imply experimental units are equal for each treatment, but that no one treatment is favored
Note: A covariates can be an independent variable (i.e. of direct interest) or it can be unwanted, confounding variable. Adding a covariate to a model can increase the accuracy of the results See more info

In summary: Block over those nuisance factors that can be easily controlled, randomize over the rest
Strategy to reduce the random error

Replicating the experimental runs; making multiple independent runs under identical treatment conditions
Strategy to reduce the measurement error

Repeat measurements, including having different subjects make the measurements;

Variation within each group of measurements can be used to estimate random and measurement errors.

Basic experimental designs

Completely randomized design (CRD)

All experimental units are assigned at random to the treatment

Batch 1 Batch 2 Batch 3 Batch 4 Batch 5

A C D D B

B A A C D

C D B A C

B C D B A

Randomly assign \(20\) samples to the \(4\) treatments \(ABCD\) without considering the batchs they are from
Randomized block design (RBD)
Forming blocks of units which are similar (in terms of chosen blocking factors)

Two special cases:
- Matched pairs design: Match subjects on the nuisance factors (controllable)
- Cross-over design: the order in which the methods are administered may introduce bias due to the learning effect
Batch 1 Batch 2 Batch 3 Batch 4 Batch 5

A D C D B

C A B C C

B C D B A

D B A A D

Randomly assign \(4\) treatments to the experiments within each block

Batch 1	Batch 2	Batch 3	Batch 4	Batch 5
A	C	D	D	B
B	A	A	C	D
C	D	B	A	C
B	C	D	B	A

Batch 1	Batch 2	Batch 3	Batch 4	Batch 5
A	D	C	D	B
C	A	B	C	C
B	C	D	B	A
D	B	A	A	D