You're browsing as a guest โ€” progress won't be saved.

Browsing as Guest
Back to ELM2: Evidence Based Practice & Epidemiology

Descriptive Statistics and Distributions

~2 min read

Lesson 15 of 20

Notes

Descriptive statistics summarise and communicate the key features of a dataset. Understanding the distributions of data underpins all subsequent inferential statistical analysis.

Study designs are classified as descriptive (surveys, describing who/what/when/where) or analytic (testing hypotheses about causal relationships โ€” RCTs, cohort, case-control). In analytic studies, replication (multiple measurements across sites or participants) allows separation of true effects from chance variation. Control groups establish what would happen without the intervention or exposure.

Confounding is the distortion of a relationship between exposure and outcome by a third variable. A confounder must be: associated with the exposure (independently of the outcome); associated with the outcome (independently of the exposure); and not on the causal pathway. Confounding is a major threat to internal validity in observational analytic studies.

Data types determine which statistical methods are appropriate. Continuous variables can take any value (e.g., blood pressure, height). Discrete variables are counted whole numbers (e.g., number of hospital admissions). Categorical variables are descriptive: nominal (no natural ordering โ€” blood group, supermarket preference) or ordinal (natural ordering โ€” pain scale, socioeconomic deprivation). Binary variables have exactly two categories (yes/no, 0/1 data, also called dichotomous data). Censored data arise when the true value is only partially known: right-censored (true value is larger than recorded, e.g., still alive at study end), left-censored, or interval-censored.

Statistical measures: ratio = x/y (same units); rate = ratio with different units; proportion = fraction of a whole.

Probability fundamentals: complementary events sum to 1; mutually exclusive events cannot both occur; conditional probability P(A|B) = probability of A given B has occurred; independent events have P(B|A) = P(B). The binomial distribution models discrete binary outcome data (e.g., disease present/absent); it requires fixed n trials, binary outcomes, constant probability ฯ€, and independent trials. The normal distribution models continuous variables โ€” symmetric, bell-shaped, defined by mean ยต and variance ฯƒยฒ. The standard normal distribution Z has ยต = 0, ฯƒยฒ = 1. The log-normal distribution is positively skewed; taking the natural log of data transforms it to normality. The 95% reference range contains the central 95% of all population values: ยต ยฑ 1.96ฯƒ for a normal distribution. Relative frequency (proportion in each category) approaches true probability as sample size increases. Sensitivity = P(positive test | disease present); specificity = P(negative test | disease absent); PPV = P(disease present | positive test); NPV = P(disease absent | negative test). Prevalence = frequency of existing cases; incidence = frequency of new cases.

What to study next