Terminologies in collecting data

Types of Statistical studies

  • Study with different purposes

    • Comparative (analytical) study

      A study whose purpose is to compare two or more alternative methods or groups distinguished by some attribute is called a comparative (analytical) study

    • Noncomparative (descriptive) study

      A study whose purpose is to learn about the characteristics of a group, but no necessarily to make comparasions, is called a noncomparative (descriptive) study

  • Study with or without intervention

    • Observational study

      The researcher is a passive observer who documents the process

    • Experimental study

      The researcher actively intervenes to control the study conditions and records the response

    Remark: Both study types can be either comparative or noncomparative

Confounding variable

If the effects of predictor variables cannot be separated from the effects of the uncontrolled variables, those uncontrolled variables are called confounding variables. A variable must meet two conditions to be a confounder:

  • It must be correlatedd with the independent variable. This may be a causal relationship, but it does not have to be.
  • It must be causally related to the dependent variable

Sometimes the presence of these variables is not recognized after the fact, in which case they are referred to as lurking variables

graph TD
A[Confounding variable]-.->B[predictor variable];
B -->C[response variable];
A -->C;

See more info

Comparative study

The goal of the simplest and most common comparative studies is to evaluate how a "change" from a "baseline" or "normal" condition affects a response variable.

  • The changed condition is called a treatment or an intervention and the normal condition is called a control

  • The treatment and control group should be similar in all other respects excepty for the changed condition to provide a valid comparison

Observational study

  • A sample survey provides a snapshot of the population based on a sample observed at a point in time
  • A prospective study follows a sample forward in time
  • A retrospective study follows a sample backward in time

Sample survey

  • A numerical characteristic of a population defined for a specific variable is called a parameter
  • A numerical function of the sample data, called a statistic, is used to estimate the unknown population parameter
  • It is possible to list all units in a finite population; a list of all units in a finite population is called a sampling frame;
  • It is impossible to list all units of an infinite population as there is no upper limit on the number of units in a population
  • The population of interest is called the target population; sometimes the target population is difficult to study, so sampling is done using another easily accessible population, called the sampled population
  • The deviation between a sample estimate and the true parameter value is called the sampling error; nonsampling error often cause bias, which is a systematic deviation between the sample estimate and the population parameter; bias is a more serious problem than sampling errors because it doesn't disappear by increasing the sample size. For example, measurement bias affects all measurements. Nonsampling error includes other types such as self-selection bias and response bias

Basic sampling designs

  • Simple random sample(SRS)

    Draw a sample of size \(n\) from a population of size \(N\) Without replacement

  • Stratified Random sampling

    Divide a population into homogeneous subpopulations and draw SRS from each one

  • Multistage cluster sampling

    Draw SRS layer by layer from the top to the bottom

  • Systematic sampling

    A 1-in-k systematic sampling consists of selecting one unit at random from the first \(k\) Unit and selecting every \(kth\) Unit thereafter; mainly useful for cases such as assembly line or cars entering a toll booth on a highway

Experimental studies

  • Treatment factors are controlled in an experiment, effects of the response variable are of primary interest
  • Nuisance factors, or noise factors are all the other factors that might affect the response variable
  • The different possible values of a factor are called levels, and each treatment is a particular combination of the levels of different treatment factors
  • Subjects or items receiving treatments are called experimental units; all experimental units receiving the same treatment form a treatment group
  • A run is an observation made on an experimental unit after a particular treatment condition; a replicate is another independent run carried out under a particular identical conditions; repeat measurements of the same response is not replicate

Strategies to reduce experimental error variation

Main components of experimental error:

  • Systematic error

    Caused by difference between experimental units; the nuisance factors on which the experimental units differ are said to confound or bias the treatment comparisons

  • Random error

    Caused by the inherent variability in the response of similar experimental units given the same treatment

  • Measurement error

    Caused by imprecise measurement instruments

  • Strategies to reduce the systematic error with respect to type of nuisance factor

    • Blocking: nuisance factor is controllable
    • Regression Analysis: nuisance factors cannot be used as blocking factor because they are not controllable; such nuisance factors are called covariates. A covariate should be measured before the treatment are applied to experimental units because it could also be affected by the treatment
    • Randomization: in terms of additional known or unknown nuisance factors, the experimental units may well differ on these factors, thus biasing the results. Randomization doesn't imply experimental units are equal for each treatment, but that no one treatment is favored

    Note: A covariates can be an independent variable (i.e. of direct interest) or it can be unwanted, confounding variable. Adding a covariate to a model can increase the accuracy of the results See more info

    In summary: Block over those nuisance factors that can be easily controlled, randomize over the rest

  • Strategy to reduce the random error

    Replicating the experimental runs; making multiple independent runs under identical treatment conditions

  • Strategy to reduce the measurement error

    Repeat measurements, including having different subjects make the measurements;

    Variation within each group of measurements can be used to estimate random and measurement errors.

Basic experimental designs

  • Completely randomized design (CRD)

    All experimental units are assigned at random to the treatment

    Batch 1 Batch 2 Batch 3 Batch 4 Batch 5
    A C D D B
    B A A C D
    C D B A C
    B C D B A

    Randomly assign \(20\) samples to the \(4\) treatments \(ABCD\) without considering the batchs they are from

  • Randomized block design (RBD)

    Forming blocks of units which are similar (in terms of chosen blocking factors)

    Two special cases:

    • Matched pairs design: Match subjects on the nuisance factors (controllable)
    • Cross-over design: the order in which the methods are administered may introduce bias due to the learning effect
    Batch 1 Batch 2 Batch 3 Batch 4 Batch 5
    A D C D B
    C A B C C
    B C D B A
    D B A A D

    Randomly assign \(4\) treatments to the experiments within each block